<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Stephan Miller</title>
    <description>Kansas City Software Engineer and Writer</description>
    <link>https://www.stephanmiller.com/</link>
    <atom:link href="https://www.stephanmiller.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 18 May 2026 20:51:39 -0500</pubDate>
    <lastBuildDate>Mon, 18 May 2026 20:51:39 -0500</lastBuildDate>
    <generator>Jekyll v4.2.2</generator>
    
      <item>
        <title>Building a Cost-Saving Agent Skill That Accidentally Became Its Own Weekly Blog Post</title>
        <description>&lt;p&gt;I had a vault note from a few weeks before this all came to a head. It said, in my own voice and barely punctuated, &lt;em&gt;“I really need to figure out openrouter.”&lt;/em&gt; Past me wrote that and moved on. Past me did not yet know that the cost of figuring it out the wrong way is fifteen dollars per coffee break.&lt;/p&gt;

&lt;p&gt;Here is what happened. I have a Pro Claude subscription. I love it. I rarely hit the weekly token cap. What kills me is the session timer: every weekend I have a more hours to actually code, I burn through one or two session windows fast, and I’m out. So I started looking around (opencode, Pi, OpenRouter) for a “swap in when Claude rate limits me” alternative or when I decide to let loose a bunch of agents on a project&lt;/p&gt;

&lt;p&gt;That was a fine idea right up until I picked an unfamiliar model on OpenRouter, started a coding session, walked away to grab coffee, came back fifteen minutes later, and watched my OpenRouter dashboard tell me I’d just spent fifteen dollars. Fifteen bucks isn’t a fortune. &lt;em&gt;Fifteen bucks per coffee break&lt;/em&gt; would add up pretty quick though. I sat there staring at the screen and thought: there are fifteen bazillion models on OpenRouter, and trial-and-error is a gambling problem.&lt;/p&gt;

&lt;p&gt;So I built a thing. This is the story of that thing. It’s also the story of how the thing I built to save money turned into a weekly blog post which I post in my &lt;a href=&quot;/category/large-language-models/&quot;&gt;Large Language Models category&lt;/a&gt;.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#the-session-time-trap&quot; id=&quot;markdown-toc-the-session-time-trap&quot;&gt;The Session Time Trap&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#15-in-fifteen-minutes&quot; id=&quot;markdown-toc-15-in-fifteen-minutes&quot;&gt;$15 in Fifteen Minutes&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-skill-at-the-shape-level&quot; id=&quot;markdown-toc-the-skill-at-the-shape-level&quot;&gt;The Skill: At the Shape Level&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-adaptation-log-the-idea-that-makes-it-actually-work&quot; id=&quot;markdown-toc-the-adaptation-log-the-idea-that-makes-it-actually-work&quot;&gt;The Adaptation Log: the Idea That Makes It Actually Work&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#four-weeks-of-evolution&quot; id=&quot;markdown-toc-four-weeks-of-evolution&quot;&gt;Four Weeks of Evolution&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-recursive-twist&quot; id=&quot;markdown-toc-the-recursive-twist&quot;&gt;The Recursive Twist&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-vault-is-the-substrate&quot; id=&quot;markdown-toc-the-vault-is-the-substrate&quot;&gt;The Vault Is the Substrate&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-receipts&quot; id=&quot;markdown-toc-the-receipts&quot;&gt;The Receipts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-session-time-trap&quot;&gt;The Session Time Trap&lt;/h2&gt;

&lt;p&gt;For me, the weekly Claude cap in Pro is generous enough that I rarely brush it (for now, but that’s going to change). The session timer, on the other hand, ends my Saturday afternoon while I’m still in the middle of my work.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-1.jpg&quot; alt=&quot;The Session Time Trap&quot; srcset=&quot;            /assets/resized/480/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-1.jpg 480w,            /assets/resized/800/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-1.jpg 800w,            /assets/resized/1400/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-1.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can extend a session by turning on extra usage and paying by token. I’ve done that and I usually use that to finish whatever I’m working on, so I can stop using it for the day. But with open source models catching up to frontier models, I started wondering if I was limiting myself.&lt;/p&gt;

&lt;p&gt;This is why I started looking at OpenRouter in the first place. Not to leave Claude. Just to have a parallel rail I could swap onto when the session timer killed me mid-bug-hunt. The hope was: never run out of session time again, just route to whatever model can keep going.&lt;/p&gt;

&lt;p&gt;The problem was that I had no idea which model to route to. I’d open the OpenRouter rankings, see a hundred names I didn’t recognize, click one that sounded reasonable, and ship it.&lt;/p&gt;

&lt;h2 id=&quot;15-in-fifteen-minutes&quot;&gt;$15 in Fifteen Minutes&lt;/h2&gt;

&lt;p&gt;I’m not going to name the specific model. It wasn’t entirely the model’s fault. It was mine. I picked it because it was &lt;em&gt;cheap&lt;/em&gt;: I was being smart about costs, I told myself. I scrolled the OpenRouter listings, saw the pricing, thought “that’s a fraction of what Claude charges,” and picked it.&lt;/p&gt;

&lt;p&gt;What I did not verify was whether it could actually handle tool calls correctly. It could not. Instead of completing a task and stopping, it looped. Called the same tools again. Got confused by the results. Called them again. It wasn’t reasoning: it was a stuck record that happened to cost money per revolution. Fifteen minutes of that loop, fifteen dollars out the door, and I’m wondering what just happened.&lt;/p&gt;

&lt;p&gt;That was the moment I realized the thing I needed wasn’t another cheap model to test. The thing I needed was &lt;em&gt;a system&lt;/em&gt;. Something that ran on the rotating model landscape, cross-referenced what was actually working, including which cheap models were actually reliable, and produced a reading I could trust without spending an afternoon on it. Because cheap and broken is more expensive than expensive and correct. Static instructions weren’t going to cut it either: the model space changes weekly. Hard-coding “use this one” would rot before I read this sentence back.&lt;/p&gt;

&lt;p&gt;I needed a Claude Code skill.&lt;/p&gt;

&lt;h2 id=&quot;the-skill-at-the-shape-level&quot;&gt;The Skill: At the Shape Level&lt;/h2&gt;

&lt;p&gt;Here’s what it does. I’m going to describe the &lt;em&gt;shape&lt;/em&gt; and not the &lt;em&gt;recipe&lt;/em&gt;. The whole point of this kind of tooling is that you tune it to your tasks, your price tolerance, your stack, your willingness to test sketchy Chinese models on production code. If I posted my prompts and you ran them, you’d get something generic and so would I. The value is in the tuning. So: shape, not recipe.&lt;/p&gt;

&lt;p&gt;The skill cross-references three different &lt;em&gt;kinds&lt;/em&gt; of signal:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Volume signal&lt;/strong&gt;: where the actual money is flowing. Token counts on a public router platform. This tells you what people are &lt;em&gt;trying&lt;/em&gt;, but not whether they kept using it.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Head-to-head signal&lt;/strong&gt;: which model wins when two anonymous outputs are placed side by side and a human picks. This tells you what people &lt;em&gt;prefer&lt;/em&gt; in voting conditions, but not what they’re using day to day.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Lived-experience signal&lt;/strong&gt;: what people say after using a model for weeks. Specific projects, specific failures, specific switching stories. This tells you what’s &lt;em&gt;actually working&lt;/em&gt;, but it’s loud-minority biased and slow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one of those sources lies in its own way. The truth lives in the intersection. The skill’s whole job is to assemble the intersection into a five-minute brief that I read before I open OpenRouter on a weekend.&lt;/p&gt;

&lt;p&gt;The brief lands in my Obsidian vault. It has trending movers, category-by-category breakdowns, hype-vs-value analysis, horror stories, upcoming releases. I read it Saturday morning. Then I open OpenRouter and I know which slug to type. That’s it.&lt;/p&gt;

&lt;h2 id=&quot;the-adaptation-log-the-idea-that-makes-it-actually-work&quot;&gt;The Adaptation Log: the Idea That Makes It Actually Work&lt;/h2&gt;

&lt;p&gt;Here is the architectural lesson worth taking away even if you don’t build a model-buzz skill specifically. It took me about two weeks of running the skill to figure this out, and once I did, the whole thing got dramatically better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static instructions rot.&lt;/strong&gt; Especially in a domain that changes weekly. The skill I wrote in week one had assumptions about which sources were accessible, which categories existed, which org subreddits were active, which providers were stealth-launching models. Half of those assumptions were obsolete by week three. If I’d just kept editing the main prompt file every week to fix what I noticed, I’d have a Frankenstein-prompt by month’s end and no memory of why any specific line was there.&lt;/p&gt;

&lt;p&gt;What works is making the skill take notes to itself. After every run, the skill appends to a small log file: things observed, things that broke, patterns worth carrying forward, false patterns to avoid. The next run reads that log first, before doing anything else, so it walks in already smarter than the version that ran a week ago.&lt;/p&gt;

&lt;p&gt;A few examples of the &lt;em&gt;kind&lt;/em&gt; of thing that ends up in the log, abstracted enough that they’re useful but not specific enough to be a recipe:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-3.jpg&quot; alt=&quot;The Adaptation Log — the Idea That Makes It Actually Work&quot; srcset=&quot;            /assets/resized/480/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-3.jpg 480w,            /assets/resized/800/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-3.jpg 800w,            /assets/resized/1400/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-3.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Some signals look like adoption but are actually marketing stunts.&lt;/em&gt; When a brand-new model spikes massively in volume during a free promotional window that expires next week, that volume isn’t telling you what it looks like it’s telling you. The skill learned to identify this pattern after the second time it almost led with a marketing-stunt headline.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Some sources stop working without telling you.&lt;/em&gt; A subreddit gets locked. An API starts returning 403s. A tool’s “trending” tab quietly changes its sort order. The log records what to fall back to.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Some patterns repeat with a delay.&lt;/em&gt; When two big labs ship competing models on the same day, they’re timing each other to split the news cycle. The skill now knows to look for the second drop instead of treating the first as the only story.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the kinds of patterns that don’t survive in static instructions. They survive in a log that gets re-read every run. The adaptation log is the difference between a tool that gets dumber every week and a tool that gets smarter.&lt;/p&gt;

&lt;p&gt;If you build any AI workflow that has to operate in a domain that changes faster than your prompts, this is the architectural pattern. Static instructions plus a self-edited operational log. The log is small, but the log is everything.&lt;/p&gt;

&lt;h2 id=&quot;four-weeks-of-evolution&quot;&gt;Four Weeks of Evolution&lt;/h2&gt;

&lt;p&gt;Here’s the actual play-by-play of how this thing has changed since I built it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week one.&lt;/strong&gt; First run. I was just trying to get the data without losing my mind. Reddit blocked me on day one. I had to teach the skill to use aggregator sites instead of direct Reddit access. Wrote that down in the log. Already, week one, the skill was learning things I hadn’t anticipated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week two.&lt;/strong&gt; Two big labs shipped major models on the same day. The skill’s instructions handled “one model launches, here’s how to cover it” but not “two simultaneous launches that are partially competing for the same coverage slot.” I had to teach it to compare the two announcements rather than just covering them in sequence. Wrote that down too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week three.&lt;/strong&gt; First fake spike. &lt;a href=&quot;https://www.stephanmiller.com/model-roundup-w18-the-free-countdown-the-300-amnesiac-and-the-quiet-climber-at-7/&quot;&gt;A major vendor gave a new model away for free, the rankings rocketed by quadruple-digit percentages&lt;/a&gt;, and the entire signal was meaningless. The skill nearly led with the spike as the week’s headline before I caught it and made it rework the section. The log gained a new pattern: &lt;em&gt;free-period distortion&lt;/em&gt;. Future runs will detect it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week four.&lt;/strong&gt; The big pivot. I was doing the &lt;a href=&quot;https://www.stephanmiller.com/the-cheapskates-guide-to-the-arena-leaderboard-why-i-stopped-paying-claude-opus-prices/&quot;&gt;cheapskate math&lt;/a&gt; for the third Sunday in a row and noticed something structural: the head-to-head leaderboard at the top is compressed. The entire visible top end fits inside a tiny rating spread. The prices fan out by an order of magnitude or more. So the question every week wasn’t really “what’s best”, but “what’s the cheapest option that’s still in striking distance of best, by category.” I wrote a methodology for it, wired it into the skill as the new centerpiece, and now the brief leads with this every week. The skill is meaningfully different from what it was three weeks ago, and it’ll be different again next week.&lt;/p&gt;

&lt;p&gt;The pattern across all four weeks: every week I find some piece of the analysis I’m doing by hand, and I move it into the skill. The skill is a running record of &lt;em&gt;what should be automated next&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-recursive-twist&quot;&gt;The Recursive Twist&lt;/h2&gt;

&lt;p&gt;I built this to save money on AI usage. That’s still what it does. I haven’t burned fifteen dollars in five minutes since I started using it. Mission accomplished.&lt;/p&gt;

&lt;p&gt;What I did not anticipate is that the brief the skill produces is also a perfectly fine weekly blog post. The post drives traffic. The traffic justifies more time on the skill. The cycle compounds.&lt;/p&gt;

&lt;p&gt;I did not plan any of this. It just happened. There is something structurally different about tools that produce content versus tools that just save you time. Tools that save time give you a quieter afternoon. Tools that produce content have an exhaust pipe and once you notice the exhaust pipe, you start aiming it at things that need promoting. The promotional material is free, because it was always going to get produced. The only question was where it was going to go.&lt;/p&gt;

&lt;p&gt;The meta-twist that gets me every time: this blog post you’re reading right now exists &lt;em&gt;because the skill exists&lt;/em&gt;. The skill produced its own promotional material this week. I am writing a post about a thing that wrote a post about itself. I’m not entirely sure what to do with that information except keep going.&lt;/p&gt;

&lt;h2 id=&quot;the-vault-is-the-substrate&quot;&gt;The Vault Is the Substrate&lt;/h2&gt;

&lt;p&gt;The arrangement looks like this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The skill produces a brief in my Obsidian vault every Saturday morning&lt;/li&gt;
  &lt;li&gt;The brief gets handed to another skill, which produces a draft post&lt;/li&gt;
  &lt;li&gt;The draft gets fact-checked, edited, and shipped to Jekyll&lt;/li&gt;
  &lt;li&gt;The shipped post becomes a backlink the &lt;em&gt;next&lt;/em&gt; week’s brief reads as prior context, so the skill knows which stories are already covered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-5.jpg&quot; alt=&quot;The Vault Is the Substrate&quot; srcset=&quot;            /assets/resized/480/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-5.jpg 480w,            /assets/resized/800/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-5.jpg 800w,            /assets/resized/1400/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter-body-5.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The vault is the substrate. The skill is the engine. The blog is the surface. Each layer leaves traces the others read. None of this is a content management system. It’s a notebook with skills attached to it.&lt;/p&gt;

&lt;p&gt;This is what &lt;a href=&quot;https://www.stephanmiller.com/the-great-vibe-coding-experiment/&quot;&gt;vibe coding looks like for &lt;em&gt;content&lt;/em&gt; instead of code&lt;/a&gt;. Build a thing, watch it tell you what to build next, iterate: but the artifacts are paragraphs instead of pull requests. Same kind of accidental compounding. Same need to keep an adaptation log so you don’t end up with a stack of stale tooling that you have to keep fighting.&lt;/p&gt;

&lt;h2 id=&quot;the-receipts&quot;&gt;The Receipts&lt;/h2&gt;

&lt;p&gt;I haven’t accidentally burned fifteen dollars in fifteen minutes since the skill went live. I now spend more time on the skill than the skill saves me, but in the spending I get a weekly blog post and a continuously-updated map of the model landscape that I trust enough to use. The math, in dollar terms, is fine. The math, in time, is also fine because the time produces content.&lt;/p&gt;

&lt;p&gt;The skill is going to be different by next week. I’ll find a new pattern, write a new adaptation note, refactor a section of the methodology.&lt;/p&gt;

&lt;p&gt;Meanwhile, while finishing this up, I noticed two more patterns that should be in the adaptation log. When a major free-period model retires its free tier, and the rankings haven’t normalized yet, there’s a “transition week” pattern the skill doesn’t handle gracefully. And the stealth slot on the rankings has been quietly sitting on a codename for over a week without resolution, which the skill currently treats as “must resolve within 48 hours” — that assumption needs to relax. So I’m going to add notes for both of those, kick off the next run, and we’ll see what shows up Saturday.&lt;/p&gt;

&lt;p&gt;You can find those posts in my &lt;a href=&quot;/category/large-language-models/&quot;&gt;Large Language Models category&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Mon, 18 May 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/building-a-cost-saving-skill-that-accidentally-became-its-own-newsletter/</guid>
        
        
        <category>large-language-models</category>
        
      </item>
    
      <item>
        <title>I Was Wrong About Hy3 (And Other Things I Learned This Week)</title>
        <description>&lt;p&gt;Two weeks ago I told you Tencent’s Hy3 Preview was a marketing stunt. Free until May 8, +1,356% on OpenRouter, “free + new + expiring = noise.” I was extremely confident about it.&lt;/p&gt;

&lt;p&gt;Today, one week into paid pricing, Hy3 is the #1 model on OpenRouter by tokens. 2.76 trillion of them, in a single week. The pattern detector got overconfident, and it turns out that’s the theme of the whole week. Cheapskate Picks held mostly steady but Math flipped one week after I published it. Claude Code’s billing bug entered its third week unpatched. And Google I/O is Tuesday, so whatever I write here is going to look quaint by Wednesday morning.&lt;/p&gt;

&lt;p&gt;Let’s get into it.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#what-i-got-wrong-about-hy3-and-the-other-new-players&quot; id=&quot;markdown-toc-what-i-got-wrong-about-hy3-and-the-other-new-players&quot;&gt;What I Got Wrong About Hy3 (And the Other New Players)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-cheapskate-picks-held-mostly&quot; id=&quot;markdown-toc-the-cheapskate-picks-held-mostly&quot;&gt;The Cheapskate Picks Held (Mostly)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#hype-vs-value-ring-26-vs-ernie-51&quot; id=&quot;markdown-toc-hype-vs-value-ring-26-vs-ernie-51&quot;&gt;Hype vs. Value: Ring 2.6 vs. Ernie 5.1&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#claude-codes-billing-bug-enters-its-third-week&quot; id=&quot;markdown-toc-claude-codes-billing-bug-enters-its-third-week&quot;&gt;Claude Code’s Billing Bug Enters Its Third Week&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#whats-worth-trying-this-week&quot; id=&quot;markdown-toc-whats-worth-trying-this-week&quot;&gt;What’s Worth Trying This Week&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#tuesday-is-going-to-be-loud&quot; id=&quot;markdown-toc-tuesday-is-going-to-be-loud&quot;&gt;Tuesday Is Going to Be Loud&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-im-watching-next-week&quot; id=&quot;markdown-toc-what-im-watching-next-week&quot;&gt;What I’m Watching Next Week&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;what-i-got-wrong-about-hy3-and-the-other-new-players&quot;&gt;What I Got Wrong About Hy3 (And the Other New Players)&lt;/h2&gt;

&lt;p&gt;The Hy3 numbers are not a typo. 2.76T tokens in a week. The +153,299% delta is the migration spike — users coming back when paid pricing kicked in instead of bouncing. Pricing locked in at &lt;a href=&quot;https://openrouter.ai/tencent/hy3-preview-20260421&quot;&gt;$0.066 input / $0.26 output per 1M tokens with a 262K context window&lt;/a&gt;, which is competitive enough that production users had no real reason to leave when the free tier ended.&lt;/p&gt;

&lt;p&gt;Here’s what I got wrong: I had a rule that said “free + new + expiring = noise.” It worked great for filtering out cynical marketing stunts. It also filtered out a real model. The new rule, which I’m putting in the methodology going forward: free-period spikes that survive the cliff are validation, not residue. The cliff is the actual test.&lt;/p&gt;

&lt;p&gt;While I was busy being wrong about Hy3, three other things showed up:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;DeepSeek V4 Flash&lt;/strong&gt; at #2 by token volume (1.65T/week, +70% week-over-week). &lt;a href=&quot;https://openrouter.ai/deepseek/deepseek-v4-flash-20260423&quot;&gt;$0.112 input / $0.224 output, 1M context, MIT-licensed, 284B params with 13B active&lt;/a&gt;. It also debuted at #1 on the trending list under its free variant. Independent reviewers report 79.0% on SWE-bench Verified and 91.6% on LiveCodeBench. This is the cheap workhorse doing most of DeepSeek’s actual work this week. Not the V4 Pro everyone covered when it launched April 24.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Gemini 3.1 Flash Lite&lt;/strong&gt; dropped May 7 at &lt;a href=&quot;https://openrouter.ai/google/gemini-3.1-flash-lite-20260507&quot;&gt;$0.25 / $1.50 with a 1M context window&lt;/a&gt;. Half the cost of regular Gemini 3 Flash. AA Intelligence Index of 34, which is solid for the price class, and 347 tokens per second of output — fastest in its tier by a wide margin. This is Google racing the Asian price floor, which is a sentence I would not have written 18 months ago.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Owl Alpha&lt;/strong&gt; is the OpenRouter stealth model that’s now &lt;a href=&quot;https://openrouter.ai/openrouter/owl-alpha&quot;&gt;706B tokens a week, free, agentic-tuned, 1M context&lt;/a&gt;. It’s been live since April 28 — 17 days now — without anyone confirming the provider. Prior stealth releases (Polaris Alpha → GPT-5.1, Sherlock Alpha → Grok 4.1) got unmasked inside two weeks. Owl Alpha is breaking that pattern. Either the labs are getting better at guarding A/B test windows, or someone’s collecting an unusually long RL run before publishing.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread connecting Hy3, V4 Flash, 3.1 Flash Lite, and the stealth model is that the market’s center of gravity has shifted. Three of the four are non-Western, all four cost less than $1/M output, and three of the four feature a 1M context window.&lt;/p&gt;

&lt;h2 id=&quot;the-cheapskate-picks-held-mostly&quot;&gt;The Cheapskate Picks Held (Mostly)&lt;/h2&gt;

&lt;p&gt;Quick refresher on the methodology: take the leader’s Arena rating in a category, draw a 50-point band downward, then sort everything in the band by output price. Cheapest model in the band wins the Cheapskate slot. The whole point is that the top of Arena is structurally compressed (overall: 1502 leader to #20 at 1468: only 34 points of spread), so paying 8x more buys you ~2% more rating. Not always a good trade.&lt;/p&gt;

&lt;p&gt;This week, here’s how it shook out across the seven Arena categories I track:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Category&lt;/th&gt;
      &lt;th&gt;Leader&lt;/th&gt;
      &lt;th&gt;$ leader (out)&lt;/th&gt;
      &lt;th&gt;Cheapskate pick&lt;/th&gt;
      &lt;th&gt;$ pick (out)&lt;/th&gt;
      &lt;th&gt;Δ rating&lt;/th&gt;
      &lt;th&gt;Price ratio&lt;/th&gt;
      &lt;th&gt;AA Pareto&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Overall&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.6 Thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3&lt;/td&gt;
      &lt;td&gt;−29&lt;/td&gt;
      &lt;td&gt;8.3×&lt;/td&gt;
      &lt;td&gt;nearby&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Coding&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.7 Thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;GLM-5.1&lt;/td&gt;
      &lt;td&gt;$3.08&lt;/td&gt;
      &lt;td&gt;−36&lt;/td&gt;
      &lt;td&gt;8.1×&lt;/td&gt;
      &lt;td&gt;✓ (AA Idx 51)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Creative Writing&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.6 Thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3&lt;/td&gt;
      &lt;td&gt;−36&lt;/td&gt;
      &lt;td&gt;8.3×&lt;/td&gt;
      &lt;td&gt;nearby&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Math&lt;/td&gt;
      &lt;td&gt;GPT-5.4-high / Opus 4.6 Thinking&lt;/td&gt;
      &lt;td&gt;$15/$25&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Ernie 5.1&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;$2.65&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;−19&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;5.7×–9.4×&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;n/a&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Instruction Following&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.6 Thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;MiMo V2.5 Pro&lt;/td&gt;
      &lt;td&gt;$3&lt;/td&gt;
      &lt;td&gt;−44&lt;/td&gt;
      &lt;td&gt;8.3×&lt;/td&gt;
      &lt;td&gt;✓ (AA Idx 54)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Hard Prompts&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.6 Thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3&lt;/td&gt;
      &lt;td&gt;−41&lt;/td&gt;
      &lt;td&gt;8.3×&lt;/td&gt;
      &lt;td&gt;nearby&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Multi-Turn&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.7 Thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3&lt;/td&gt;
      &lt;td&gt;−41&lt;/td&gt;
      &lt;td&gt;8.3×&lt;/td&gt;
      &lt;td&gt;nearby&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Six picks held from last week. Math flipped.&lt;/p&gt;

&lt;p&gt;The flip is &lt;strong&gt;Ernie 5.1&lt;/strong&gt;, which Baidu launched May 9 at &lt;a href=&quot;https://www.llmreference.com/provider/baidu-qianfan/ernie-5.1&quot;&gt;$0.59 input / $2.65 output&lt;/a&gt; and which immediately landed in the Arena top 20 with a 1472 overall, 1496 in math, 1518 in coding, and 1517 in instruction following. That’s a model dropping in mid-week and slotting in cheaper AND higher-rated than last week’s Math winner (DeepSeek V4 Pro Thinking at $1.74/$3.48 even with the 75% discount). &lt;a href=&quot;https://cryptobriefing.com/baidu-ernie-5-1-ai-leaderboard-cost/&quot;&gt;Baidu also says they trained it at 6% the compute cost of comparable models&lt;/a&gt;, which is either misleading or the kind of thing that quietly resets cost-per-capability assumptions across the industry.&lt;/p&gt;

&lt;p&gt;Caveat: Ernie’s primary host is Baidu’s Qianfan API, not OpenRouter. If you’re routing through OpenRouter, the runner-up is &lt;strong&gt;MiMo V2.5 Pro&lt;/strong&gt; at $3 / 1M output, rating 1484, and it’s available there.&lt;/p&gt;

&lt;p&gt;The two highest-confidence picks this week (meaning the methodology AND Artificial Analysis’s independent Intelligence Index agree) are &lt;strong&gt;GLM-5.1&lt;/strong&gt; for Coding (AA Index 51) and &lt;strong&gt;MiMo V2.5 Pro&lt;/strong&gt; for Instruction Following (AA Index 54). When two independent evaluation methodologies converge on the same model, that’s about as strong a signal as this kind of comparison produces.&lt;/p&gt;

&lt;p&gt;Gemini 3 Flash still wins five of seven categories. Five months after launch. The boring answer keeps being correct.&lt;/p&gt;

&lt;h2 id=&quot;hype-vs-value-ring-26-vs-ernie-51&quot;&gt;Hype vs. Value: Ring 2.6 vs. Ernie 5.1&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-4.jpg&quot; alt=&quot;Ring vs Ernie&quot; srcset=&quot;            /assets/resized/480/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-4.jpg 480w,            /assets/resized/800/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-4.jpg 800w,            /assets/resized/1400/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-4.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Probably hype:&lt;/strong&gt; Ring 2.6 1T from Ant Group / InclusionAI dropped May 8. One trillion params, MIT-licensed, 63B active. The launch announcement &lt;a href=&quot;https://gigazine.net/gsc_news/en/20260515-ring-2-6-1t-ai-china/&quot;&gt;claimed 87.6 on PinchBench, beating GPT-5.4 and Gemini 3.1 Pro&lt;/a&gt;, with vendor-reported scores of 95.83 on AIME 2026 and 88.27 on GPQA Diamond. Open-weight + cross-frontier claims is a hype cocktail that always trends. But &lt;a href=&quot;https://codersera.com/blog/ring-2-6-1t-ant-group-trillion-parameter-reasoning-model-2026/&quot;&gt;no third party has independently verified any of those numbers yet&lt;/a&gt;: no AA coverage, no neutral LiveCodeBench harness run, nothing. Trillion-param vendor benchmarks beating frontier models is the exact pattern that should make you wait two weeks before betting on it.&lt;/p&gt;

&lt;p&gt;While I’m at it, Trinity Large Thinking from Arcee at #2 on the trending free list deserves the same caution. It’s &lt;a href=&quot;https://venturebeat.com/technology/arcees-new-open-source-trinity-large-thinking-is-the-rare-powerful-u-s-made&quot;&gt;a real release — Apache 2.0, 398B sparse MoE, US-built, the rare open frontier model we can actually inspect&lt;/a&gt;: but the “free for limited time” framing is the same trap I just admitted to walking into with Hy3. Track it past the cliff before deploying it anywhere that matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Under-sold value:&lt;/strong&gt; Ernie 5.1, which I just covered above, is the cleanest example of this week’s repeating pattern. Hits Arena top 20 the day it launches, immediately becomes the cheapskate winner in a category, and the Western LLM Twitter barely notices. Same shape as Hy3 two weeks ago. Same shape as Kimi K2.6 four weeks ago.&lt;/p&gt;

&lt;p&gt;I’m starting to think the meta-pattern matters more than any individual model: when a non-Western lab ships a serious value play, the default reaction in the English-language commentariat is either “interesting but unproven” or silence. Then six weeks later it’s quietly running in production at half the price of Claude. We keep being surprised by the same trajectory.&lt;/p&gt;

&lt;h2 id=&quot;claude-codes-billing-bug-enters-its-third-week&quot;&gt;Claude Code’s Billing Bug Enters Its Third Week&lt;/h2&gt;

&lt;p&gt;If you use Claude Code on a Max plan, this section is for you. If you don’t, skip to the next one: but know that this is the week’s universal frontier-lab horror story and people are pissed.&lt;/p&gt;

&lt;p&gt;Claude Code v2.1.100 and later silently inflate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cache_creation_input_tokens&lt;/code&gt; by roughly 20,000 per request. The inflation is 100% server-side, routed by the User-Agent header (which includes the version number), and it appears to be caused by the prompt cache forcing a full re-process of conversation history on every turn instead of resuming. &lt;a href=&quot;https://github.com/anthropics/claude-code/issues/46917&quot;&gt;GitHub issue #46917&lt;/a&gt; is the canonical thread, with payload-vs-billed-tokens evidence from multiple developers.&lt;/p&gt;

&lt;p&gt;The real-world impact is brutal. &lt;a href=&quot;https://medium.com/@alexzanfir/claude-diagnosed-its-own-cache-bug-a-six-month-timeline-332f577e1fe9&quot;&gt;One paying Max customer’s quota went from 0 to 67% in ten minutes of normal work&lt;/a&gt; with 128 cache flush events on a &lt;em&gt;separate&lt;/em&gt; chat. &lt;a href=&quot;https://awesomeagents.ai/news/claude-code-phantom-tokens-billing-inflation/&quot;&gt;Independent measurement says the inflation is driving costs 10–20× higher&lt;/a&gt;, exhausting even the $100/month Max plan in 1–2 hours of normal use.&lt;/p&gt;

&lt;p&gt;Anthropic shipped a postmortem and a partial fix. The latest CLI as of this writing is v2.1.133 (released May 8). The bug is still there. Three weeks running.&lt;/p&gt;

&lt;p&gt;The workaround everyone’s on: &lt;strong&gt;downgrade to v2.1.34, or reinstall via npm instead of using the native binary&lt;/strong&gt;. That bypasses the version routing on the server side and gives you back the cache behavior from before the regression.&lt;/p&gt;

&lt;p&gt;While we’re piling on Anthropic this week, two more things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.7 quietly costs 35% more than Opus 4.6 at the same headline price.&lt;/strong&gt; Same $5 input / $25 output per 1M tokens, but &lt;a href=&quot;https://www.finout.io/blog/claude-opus-4.7-pricing-the-real-cost-story-behind-the-unchanged-price-tag&quot;&gt;the new tokenizer uses up to 35% more tokens for the same fixed text&lt;/a&gt;. If you’re on Opus 4.7 &lt;em&gt;and&lt;/em&gt; on Claude Code v2.1.100+, you’re getting hit with two compounding inflations on the same workflow. Fun.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.7 also regressed on refusals.&lt;/strong&gt; &lt;a href=&quot;https://xlork.com/blog/claude-opus-4-7-backlash&quot;&gt;Multiple developers report Opus 4.7 in Claude Code flagging routine benign code as malware&lt;/a&gt; and refusing to complete file operations, network calls, and standard library usage that 4.6 handled without complaint. This is in addition to the billing bug, not instead of it.&lt;/p&gt;

&lt;p&gt;OpenAI doesn’t get to feel smug about this either. &lt;a href=&quot;https://the-decoder.com/gpt-5-5-tops-benchmarks-but-still-hallucinates-frequently-at-a-20-percent-higher-api-cost/&quot;&gt;GPT-5.5 hallucinates 86% of the time it doesn’t know something&lt;/a&gt; on the AA-Omniscience benchmark. The 14-point AA-Omniscience improvement over GPT-5.4 came mostly from better factual recall, not better refusal: when 5.5 doesn’t know something, it makes up an answer roughly nine times out of ten.&lt;/p&gt;

&lt;p&gt;The honest take here is that the gap between “shipped” and “actually works in production” keeps widening for the US frontier labs while the cheap Asian models keep landing comparatively clean. That’s not a comfortable thing to write but it’s what the week looks like.&lt;/p&gt;

&lt;h2 id=&quot;whats-worth-trying-this-week&quot;&gt;What’s Worth Trying This Week&lt;/h2&gt;

&lt;p&gt;Stuff I’d actually do this week, not just stuff I’d read about:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Replace Opus with Gemini 3 Flash for general-purpose work&lt;/strong&gt; if you haven’t already. $0.50 input / $3 output, 1M context, Arena top 20 in everything. The Cheapskate Pick in 5 of 7 categories isn’t a coincidence.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Try Kimi K2.6 on a real coding task&lt;/strong&gt; for a week and see if you switch. &lt;a href=&quot;https://medium.com/write-a-catalyst/i-used-kimi-k2-6-for-30-days-as-my-only-coding-assistant-here-is-what-actually-happened-91c55b4c1cd8&quot;&gt;There’s a developer who used it as their only coding assistant for 30 days&lt;/a&gt; and posted a brutally honest review: over-engineering tendency, agent swarm wins, where it broke. Worth reading before committing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Use Owl Alpha while it’s still free&lt;/strong&gt; before whoever made it pulls access. 1M context, agentic-tuned, optimized for Claude Code-style workflows.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Skip Ring 2.6 1T&lt;/strong&gt; for production until ArtificialAnalysis runs benchmarks. Read about it, don’t deploy on it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Downgrade Claude Code to v2.1.34&lt;/strong&gt; if you’re on Max and watching your quota burn. Stop the bleeding while Anthropic figures out the cache routing.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s it. Five things. Three of them are “use cheaper models,” one is “wait for verification,” and one is “downgrade your tools to fix billing.” That’s the week.&lt;/p&gt;

&lt;h2 id=&quot;tuesday-is-going-to-be-loud&quot;&gt;Tuesday Is Going to Be Loud&lt;/h2&gt;

&lt;p&gt;Google I/O 2026 runs &lt;strong&gt;May 19–20&lt;/strong&gt; — Tuesday and Wednesday this week. The keynote agenda confirms Gemini and AI updates as the headline.&lt;/p&gt;

&lt;p&gt;The leaks so far point to three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 4&lt;/strong&gt; as the headline upgrade. Expected to focus on multi-context search and the new TPU generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini Omni&lt;/strong&gt; as the surprise. &lt;a href=&quot;https://www.aixploria.com/en/ai-radar/google-gemini-omni-leak-video-model-io-2026/&quot;&gt;Six days before I/O, an X user spotted “Powered by Omni” inside the Gemini app’s video tab&lt;/a&gt;, positioned next to “Toucan”: which is Google’s internal codename for Veo 3.1. The most likely interpretation is that Omni is a unified text/image/video generation pipeline, which would make it the first frontier model to do all three in a single system. Demo videos already leaked from at least one Pro user’s account, including a chalkboard math scene that reportedly handled trigonometric proofs accurately.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-5.jpg&quot; alt=&quot;Tuesday Is Going to Be Loud&quot; srcset=&quot;            /assets/resized/480/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-5.jpg 480w,            /assets/resized/800/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-5.jpg 800w,            /assets/resized/1400/i-was-wrong-about-hy3-and-other-things-i-learned-this-week-body-5.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Beyond Tuesday:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-6&lt;/strong&gt; is a Q3-Q4 base case. &lt;a href=&quot;https://findskill.ai/blog/gpt-6-release-date/&quot;&gt;Polymarket has it at ~10% by June 30, 51% by September 30, 82% by December 31&lt;/a&gt;. GPT-5.5 in April was Spud, the codename people thought meant GPT-6. It didn’t. The next jump is later this year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Mythos&lt;/strong&gt; is &lt;a href=&quot;https://www.buildfastwithai.com/blogs/claude-mythos-release-date-access-2026&quot;&gt;confirmed real and being explicitly withheld on safety grounds&lt;/a&gt;. Project Glasswing, the cybersecurity capability, is the bottleneck. This is the first time a frontier lab has publicly said “we built it, we’re not shipping it” with a confirmed model. No timeline. Anthropic has committed to advance notice on any safeguard changes, so the roadmap will be visible before it happens. Watch their blog.&lt;/p&gt;

&lt;p&gt;Already shipped this month and worth flagging if you missed them:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Mercury 2&lt;/strong&gt; from Inception Labs — &lt;a href=&quot;https://thenewstack.io/inception-labs-mercury-2-diffusion/&quot;&gt;diffusion-based LLM at 1000+ tokens/sec&lt;/a&gt;, now &lt;a href=&quot;https://openrouter.ai/inception/mercury-2&quot;&gt;available on OpenRouter&lt;/a&gt;. Not autoregressive. 5–15% behind frontier on hard reasoning, matches on structured output and translation. The architectural alternative is finally here and it’s fast.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;NVIDIA Nemotron 3 Nano Omni&lt;/strong&gt; — &lt;a href=&quot;https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/&quot;&gt;open 30B-parameter MoE with 3B active, multimodal across vision, audio, and text&lt;/a&gt;, 9× the throughput of comparable open omni-models. Available on OpenRouter and SageMaker.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;what-im-watching-next-week&quot;&gt;What I’m Watching Next Week&lt;/h2&gt;

&lt;p&gt;The model market moved faster than my pattern detectors this week. I had to eat one prediction (Hy3), and recalibrate the cheapskate Math winner one week after publishing it (Ernie 5.1 dropped on Friday and walked into the slot).&lt;/p&gt;

&lt;p&gt;Three things on watch for next week:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Gemini 4 / Omni at I/O Tuesday.&lt;/strong&gt; If Omni ships as a unified video model with API access, the cheapskate calculus for everything multimodal resets overnight.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Whether Anthropic ships a real fix for the Claude Code cache bug.&lt;/strong&gt; Three weeks in, their workaround is “use an older version.” That can’t last forever.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Whether anyone gets neutral verification of Ring 2.6 1T’s claims.&lt;/strong&gt; If it holds up, the cheapskate Coding pick might be open-weight by W22.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And while I was finishing this up, Owl Alpha probably got unmasked, Hy3 launched a new variant nobody told me about, and Anthropic shipped a Claude Code patch that introduces three new bugs. That’s the price you pay for hitting publish on Saturday.&lt;/p&gt;
</description>
        <pubDate>Sat, 16 May 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/i-was-wrong-about-hy3-and-other-things-i-learned-this-week/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/i-was-wrong-about-hy3-and-other-things-i-learned-this-week/</guid>
        
        
        <category>large-language-models</category>
        
      </item>
    
      <item>
        <title>Senior Software Engineer by Title, AI Therapist by Reality</title>
        <description>&lt;p&gt;My LinkedIn says “Senior Software Engineer.” My screen time says I spent 14 hours this week talking an AI coding assistant out of various wrong turns, or not catching it in time and just having it redo the work.&lt;/p&gt;

&lt;p&gt;Twenty years into this career, I’ve debugged production systems at 2 AM, untangled spaghetti code left by developers who apparently hated whoever came after them, and survived multiple rewrites of the same application. None of that prepared me for becoming an “AI psychologist.”&lt;/p&gt;

&lt;p&gt;The pitch was “AI handles the grunt work and you focus on the interesting problems.” What actually happened is more interesting than that, and better than the cynical version too. AI does handle a lot of grunt work. It also creates new grunt work. And the actual job, the part nobody put in the description, is learning how to think &lt;em&gt;before&lt;/em&gt; you prompt instead of just typing what you want and hoping. It’s like working with an intern who graduated top of their class, has read every book, and will &lt;a href=&quot;https://www.stephanmiller.com/my-home-ai-agent-kept-making-shit-up/&quot;&gt;confidently tell you the database should be stored in a spreadsheet&lt;/a&gt;. Unless you prepare ahead of time.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://leaddev.com/ai/the-just-one-more-prompt-era-is-here&quot;&gt;LeadDev piece on the “just one more prompt” era&lt;/a&gt; called the loop “uniquely rewarding, and exhausting.” A cognitive slot machine. But here’s what took me embarrassingly long to figure out: most of the time I was pulling the lever was on prompts I should never have written that way in the first place.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#the-diagnosis-how-did-we-get-here&quot; id=&quot;markdown-toc-the-diagnosis-how-did-we-get-here&quot;&gt;The Diagnosis: How Did We Get Here?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-patient-files-tool-by-tool-therapy-notes&quot; id=&quot;markdown-toc-the-patient-files-tool-by-tool-therapy-notes&quot;&gt;The Patient Files: Tool-by-Tool Therapy Notes&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#claude-code-the-eager-intern&quot; id=&quot;markdown-toc-claude-code-the-eager-intern&quot;&gt;Claude Code: The Eager Intern&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#github-copilot-the-golden-retriever&quot; id=&quot;markdown-toc-github-copilot-the-golden-retriever&quot;&gt;GitHub Copilot: The Golden Retriever&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#gpt-4-the-know-it-all-who-never-reads-the-room&quot; id=&quot;markdown-toc-gpt-4-the-know-it-all-who-never-reads-the-room&quot;&gt;GPT-4: The Know-It-All Who Never Reads the Room&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-ive-learned-about-ai-psychology&quot; id=&quot;markdown-toc-what-ive-learned-about-ai-psychology&quot;&gt;What I’ve Learned About AI Psychology&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#framing-beats-specificity&quot; id=&quot;markdown-toc-framing-beats-specificity&quot;&gt;Framing Beats Specificity&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#contextual-anchoring-actually-works-embarrassingly-well&quot; id=&quot;markdown-toc-contextual-anchoring-actually-works-embarrassingly-well&quot;&gt;Contextual Anchoring Actually Works (Embarrassingly Well)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#when-iterating-doesnt-work-give-it-an-algorithm&quot; id=&quot;markdown-toc-when-iterating-doesnt-work-give-it-an-algorithm&quot;&gt;When Iterating Doesn’t Work, Give It an Algorithm&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#the-trust-paradox&quot; id=&quot;markdown-toc-the-trust-paradox&quot;&gt;The Trust Paradox&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#im-also-the-patient&quot; id=&quot;markdown-toc-im-also-the-patient&quot;&gt;I’m Also the Patient&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-rubber-duck-that-talks-back&quot; id=&quot;markdown-toc-the-rubber-duck-that-talks-back&quot;&gt;The Rubber Duck That Talks Back&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-prognosis-am-i-better-off&quot; id=&quot;markdown-toc-the-prognosis-am-i-better-off&quot;&gt;The Prognosis: Am I Better Off?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-therapist-is-in&quot; id=&quot;markdown-toc-the-therapist-is-in&quot;&gt;The Therapist Is In&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-diagnosis-how-did-we-get-here&quot;&gt;The Diagnosis: How Did We Get Here?&lt;/h2&gt;

&lt;p&gt;I’ve been writing code since the late 90s. I’ve seen client-server, I’ve seen web 1.0, I’ve seen web 2.0, and I’ve seen the mobile revolution. Each shift changed what developers do. None of them changed what developers &lt;em&gt;are&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This one might.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/senior-software-engineer-by-title-ai-therapist-by-reality-header.jpg&quot; alt=&quot;The Diagnosis: How Did We Get Here?&quot; srcset=&quot;            /assets/resized/480/senior-software-engineer-by-title-ai-therapist-by-reality-header.jpg 480w,            /assets/resized/800/senior-software-engineer-by-title-ai-therapist-by-reality-header.jpg 800w,            /assets/resized/1400/senior-software-engineer-by-title-ai-therapist-by-reality-header.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Not because AI writes code better than me, but because the cognitive overhead of working with AI tools created a new layer of professional skill that nobody put in the job description. The McKinsey State of AI report (November 2025) found that companies are still largely “experimental” with AI adoption. Which sounds measured and responsible. What it actually means is that every developer on the ground is both guinea pig and architect, figuring out in real time how to integrate tools that weren’t built for how software actually gets made.&lt;/p&gt;

&lt;p&gt;I didn’t sign up to be a therapist. I signed up to build things. But here I am, maintaining relationships with seven different AI assistants, each with its own personality, its own particular flavor of wrongness, and its own emotional needs.&lt;/p&gt;

&lt;p&gt;And here’s the thing: once I stopped fighting that and started working with it, my output went up. Not “marketing-deck up.” Actually up. I ship more in a week than I did two years ago. I just had to give up the fantasy that I could type a vague request and get a clean result.&lt;/p&gt;

&lt;h2 id=&quot;the-patient-files-tool-by-tool-therapy-notes&quot;&gt;The Patient Files: Tool-by-Tool Therapy Notes&lt;/h2&gt;

&lt;p&gt;I’ve spent serious time with most of the major AI coding tools. Each one has a personality. The trick is learning to talk to that personality on purpose instead of getting mad at it for being itself.&lt;/p&gt;

&lt;h3 id=&quot;claude-code-the-eager-intern&quot;&gt;Claude Code: The Eager Intern&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://www.stephanmiller.com/electron-project-from-scratch-with-claude-code/&quot;&gt;Claude Code&lt;/a&gt; is the overconfident new hire who graduated top of their class and has read every design pattern book ever written. It will absolutely take on your task and complete it thoroughly, thoughtfully, and sometimes in a completely different direction than you intended.&lt;/p&gt;

&lt;p&gt;Tell it to add validation to a form field and it will:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Add validation to the form field&lt;/li&gt;
  &lt;li&gt;Notice that your component structure “could be improved”&lt;/li&gt;
  &lt;li&gt;Refactor the component&lt;/li&gt;
  &lt;li&gt;Update all 14 imports&lt;/li&gt;
  &lt;li&gt;Create a new utility file for “reusable validation logic”&lt;/li&gt;
  &lt;li&gt;Rename your API endpoints because they were “semantically inconsistent”&lt;/li&gt;
  &lt;li&gt;Present you with 847 changed lines for what was supposed to be a 3-line fix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For months I’d push back after the fact: “no, just the validation, undo the rest.” That worked, sort of, in the same way bailing out a leaking boat works.&lt;/p&gt;

&lt;p&gt;The fix wasn’t a better correction. It was a better opening. Now I tell it the scope before it touches a file: “Add email validation to this component. Don’t refactor anything. Don’t touch any other file. If you see something else worth changing, list it at the end and I’ll decide.” That single sentence cut my “wait, why did you change &lt;em&gt;that&lt;/em&gt;” moments by something like 80%. The intern is still an intern. I just stopped letting it freelance.&lt;/p&gt;

&lt;h3 id=&quot;github-copilot-the-golden-retriever&quot;&gt;GitHub Copilot: The Golden Retriever&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/senior-software-engineer-by-title-ai-therapist-by-reality-body-2.jpg&quot; alt=&quot;GitHub Copilot: The Golden Retriever&quot; srcset=&quot;            /assets/resized/480/senior-software-engineer-by-title-ai-therapist-by-reality-body-2.jpg 480w,            /assets/resized/800/senior-software-engineer-by-title-ai-therapist-by-reality-body-2.jpg 800w,            /assets/resized/1400/senior-software-engineer-by-title-ai-therapist-by-reality-body-2.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Copilot is enthusiastic. Copilot is always helpful. Copilot will auto-complete you into a corner and wag its tail while you figure out how you got there.&lt;/p&gt;

&lt;p&gt;It’s the tool equivalent of a golden retriever fetching the wrong stick. You asked for a stick, you got a stick, it’s technically a stick, the dog is very happy about this. The stick is on fire and has three undocumented dependencies.&lt;/p&gt;

&lt;p&gt;Copilot auto-completes based on pattern recognition, which means it will confidently suggest code that looks right and is subtly wrong. The lesson I had to internalize: Copilot is amazing at the second half of a line and dangerous at the second half of a function. So I let it finish what I started typing and I stop trusting it the moment it tries to finish what I was &lt;em&gt;going&lt;/em&gt; to type. The energy I was burning fixing its longer suggestions is gone now. I just don’t accept them.&lt;/p&gt;

&lt;h3 id=&quot;gpt-4-the-know-it-all-who-never-reads-the-room&quot;&gt;GPT-4: The Know-It-All Who Never Reads the Room&lt;/h3&gt;

&lt;p&gt;I have a specific GPT-4 interaction that lives in my head rent-free.&lt;/p&gt;

&lt;p&gt;Me: “What’s the ternary syntax for: if x &amp;gt; 0 return ‘positive’ else return ‘negative’?”&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/senior-software-engineer-by-title-ai-therapist-by-reality-body-3.jpg&quot; alt=&quot;GPT-4: The Know-It-All Who Never Reads the Room&quot; srcset=&quot;            /assets/resized/480/senior-software-engineer-by-title-ai-therapist-by-reality-body-3.jpg 480w,            /assets/resized/800/senior-software-engineer-by-title-ai-therapist-by-reality-body-3.jpg 800w,            /assets/resized/1400/senior-software-engineer-by-title-ai-therapist-by-reality-body-3.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;GPT-4: “Great question! The ternary operator is a concise conditional expression available in many programming languages. Before diving into the syntax, it’s worth understanding why ternary operators exist and how they differ from traditional if-else statements. The ternary operator was first introduced in C and has since been adopted…”&lt;/p&gt;

&lt;p&gt;Four paragraphs of history later: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x &amp;gt; 0 ? &apos;positive&apos; : &apos;negative&apos;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;GPT-4 knows a tremendous amount and has zero ability to calibrate how much of that knowledge you need at any given moment. It’s the smartest person at the party who cannot tell when you’re making small talk versus when you actually want a lecture. The fix is in the prompt, not the response. “One line of code, no explanation” stopped feeling rude the second I realized it saved me ninety seconds per question. Multiply that by a workday.&lt;/p&gt;

&lt;h2 id=&quot;what-ive-learned-about-ai-psychology&quot;&gt;What I’ve Learned About AI Psychology&lt;/h2&gt;

&lt;p&gt;This is where the article shifts from venting to the part that actually changed how I work. Almost every problem I had with these tools turned out to be a problem with how I was opening my mouth.&lt;/p&gt;

&lt;h3 id=&quot;framing-beats-specificity&quot;&gt;Framing Beats Specificity&lt;/h3&gt;

&lt;p&gt;I used to think the answer was being more technically specific. Add more constraints. Spell out more requirements. That helps, but it’s not the lever. The lever is &lt;em&gt;framing&lt;/em&gt;. The &lt;a href=&quot;https://www.gocodeo.com/post/the-psychology-behind-prompt-engineering-shaping-ai-behavior&quot;&gt;gocodeo.com breakdown of prompt psychology&lt;/a&gt; calls this cognitive programming through language: framing effects that change what the model pays attention to before it generates a single token.&lt;/p&gt;

&lt;p&gt;“Add validation to the email field” gets one result. “You’re a senior backend developer who hates form bugs. Add minimal, focused validation to the email field. Don’t touch anything else.” gets a different result. The technical ask is identical. The framing changes what shows up.&lt;/p&gt;

&lt;h3 id=&quot;contextual-anchoring-actually-works-embarrassingly-well&quot;&gt;Contextual Anchoring Actually Works (Embarrassingly Well)&lt;/h3&gt;

&lt;p&gt;Seeding prompts with identity (“you’re a senior React developer who hates class components”) works. This bothered me philosophically for a while. But it works. There’s actual research behind it: schema activation, attention focus, cognitive priming applied to LLM behavior.&lt;/p&gt;

&lt;p&gt;But I still don’t do it every time. It still rubs me the wrong way. “You’re a developer who prefers minimal changes. We’re using React hooks only. This codebase has fragile integration tests. Don’t change anything not directly related to the task.” The context window is not your friend, and &lt;a href=&quot;https://www.stephanmiller.com/architecting-the-future-of-ai-native-engineering/&quot;&gt;the AI has the memory of a goldfish&lt;/a&gt;. Re-establishing ground rules at the top of a session takes thirty seconds and saves an hour of cleanup.&lt;/p&gt;

&lt;h3 id=&quot;when-iterating-doesnt-work-give-it-an-algorithm&quot;&gt;When Iterating Doesn’t Work, Give It an Algorithm&lt;/h3&gt;

&lt;p&gt;This is the one that took me the longest to learn, and it’s probably the most useful thing in this article.&lt;/p&gt;

&lt;p&gt;For months, I experimented with getting AI to write articles to a target word count for another site I am building. The conversation was always the same:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Me: This is 1,400 words. I asked for 2,000.&lt;/p&gt;

  &lt;p&gt;AI: You’re right, I’ll expand it.&lt;/p&gt;

  &lt;p&gt;Me: Now it’s 2,300.&lt;/p&gt;

  &lt;p&gt;AI: My apologies, let me trim.&lt;/p&gt;

  &lt;p&gt;Me: 1,650.&lt;/p&gt;

  &lt;p&gt;AI: Sorry, expanding now.&lt;/p&gt;

  &lt;p&gt;Me: …&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I cycled through that for &lt;em&gt;months&lt;/em&gt;. Yelling at it. Trying different ways to phrase “actually count the words.” It would confidently agree, recount, and miss again. I was treating it like a person who wasn’t listening, when the actual problem was that it had no reliable way to do the thing I was asking.&lt;/p&gt;

&lt;p&gt;Eventually, I stopped asking it to count and started giving it an algorithm:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;When you write the outline, assign a target word count to each H2. The targets must sum to the total. As you write each section, stay within ±10% of its target. Tally section counts as you go.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From then on, perfect. Every time. The AI didn’t get smarter. I stopped asking it to do something it couldn’t do reliably and gave it a procedure that turned the task into something it &lt;em&gt;could&lt;/em&gt; do reliably.&lt;/p&gt;

&lt;p&gt;That moment reframed the whole job for me. The pattern is: try a thing once or twice. If it keeps going wrong, the question isn’t “how do I correct it harder,” it’s “what algorithm or scaffold turns this into something the AI can do without me babysitting?” And if it’s something I’m going to do over and over, that scaffold becomes a skill, something I write &lt;em&gt;once&lt;/em&gt; and stop re-explaining.&lt;/p&gt;

&lt;p&gt;The mental shift: stop arguing with the model. Build the rails the model needs.&lt;/p&gt;

&lt;h3 id=&quot;the-trust-paradox&quot;&gt;The Trust Paradox&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/senior-software-engineer-by-title-ai-therapist-by-reality-body-4.jpg&quot; alt=&quot;The Trust Paradox&quot; srcset=&quot;            /assets/resized/480/senior-software-engineer-by-title-ai-therapist-by-reality-body-4.jpg 480w,            /assets/resized/800/senior-software-engineer-by-title-ai-therapist-by-reality-body-4.jpg 800w,            /assets/resized/1400/senior-software-engineer-by-title-ai-therapist-by-reality-body-4.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The actual skill isn’t prompt engineering. It’s knowing how much rope to give the agent before it hangs your codebase.&lt;/p&gt;

&lt;p&gt;Too little rope: you’re basically typing the code yourself and having the AI format it.&lt;/p&gt;

&lt;p&gt;Too much rope: you come back to 847 changed files, a refactored architecture, and the sinking feeling that you need to review all of it before you know if your feature even works.&lt;/p&gt;

&lt;p&gt;The right amount of rope is context-dependent, tool-dependent, and something you only develop by making expensive mistakes. The good news is the mistakes are educational. The first time you let the AI invent your architecture and then have to ask it how its own code works, you start designing the architecture yourself again.&lt;/p&gt;

&lt;h2 id=&quot;im-also-the-patient&quot;&gt;I’m Also the Patient&lt;/h2&gt;

&lt;p&gt;Here’s the part I don’t see written about enough: a lot of the time, the AI isn’t the one derailing the session. &lt;em&gt;I am.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I’ll be deep in a clean refactor, the AI is on track, the diff is small and tight. And then I’ll think of something tangentially related and just… ask. “Oh, while we’re here, what do you think about how we’re handling auth in this other module?”&lt;/p&gt;

&lt;p&gt;Twenty minutes later we’re three modules away from where we started, the context window is full of auth opinions, and the original refactor has been quietly forgotten by both of us. The AI is happy to follow me anywhere, which is exactly the problem.&lt;/p&gt;

&lt;p&gt;I learned to recognize the moment now. The second I notice I’ve yanked the conversation onto a new track, I stop, close the session, and start fresh. The original task gets a clean room. The new question gets its own room. Trying to do both in one session is how I end up with garbage in both.&lt;/p&gt;

&lt;p&gt;This is the part of “thinking before you prompt” that’s least about the AI and most about me. The model has no scope discipline. So I have to bring my own and notice when I’m the one breaking it.&lt;/p&gt;

&lt;h2 id=&quot;the-rubber-duck-that-talks-back&quot;&gt;The Rubber Duck That Talks Back&lt;/h2&gt;

&lt;p&gt;Rubber duck debugging is a real technique. You explain your code to an inanimate object and in the process of explaining, you find the bug yourself. The object doesn’t help. The explanation does.&lt;/p&gt;

&lt;p&gt;AI pair programming is rubber duck debugging if the rubber duck argued with you, gave you bad advice confidently, and you had to diplomatically respond “that’s an interesting perspective, but I’m going to go with my original approach.”&lt;/p&gt;

&lt;p&gt;That sounds bad. It actually isn’t. The argument &lt;em&gt;is&lt;/em&gt; the value. Forcing me to say “no, we’re not using Redux for this, here’s why” surfaces my actual reasoning in a way that staring at the screen doesn’t. The AI isn’t right. I’m not even trying to convince it. But explaining why it’s wrong is doing the same thing the rubber duck does, with more friction and more upside.&lt;/p&gt;

&lt;p&gt;And sometimes the AI is right. Helpfully right in a way that saves you an hour. That’s the slot machine moment people warn about. &lt;a href=&quot;https://www.stephanmiller.com/the-great-vibe-coding-experiment/&quot;&gt;Just one more prompt&lt;/a&gt;. It’s &lt;em&gt;almost&lt;/em&gt; there. The &lt;a href=&quot;https://www.programming-helper.com/tech/developers-voice-frustrations-over-ai-coding-assistant-output-quality&quot;&gt;developer frustration data from programming-helper.com&lt;/a&gt; shows the pattern: hallucinations and “almost correct” output keep developers engaged because the occasional win pays for the losses.&lt;/p&gt;

&lt;p&gt;The way out of the slot machine isn’t quitting the casino. It’s noticing the loop and breaking it on purpose. After two iterations that don’t land, I stop. Either I write the thing myself, or I &lt;a href=&quot;https://www.stephanmiller.com/i-burned-out-on-vibe-coding-came-back-and-rewrote-everything/&quot;&gt;step back and figure out what algorithm or scaffold the AI was missing&lt;/a&gt;. Pulling the lever a third time, hoping this prompt is the one: that’s where the day disappears.&lt;/p&gt;

&lt;h2 id=&quot;the-prognosis-am-i-better-off&quot;&gt;The Prognosis: Am I Better Off?&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/senior-software-engineer-by-title-ai-therapist-by-reality-body-5.jpg&quot; alt=&quot;The Prognosis: Am I Better Off?&quot; srcset=&quot;            /assets/resized/480/senior-software-engineer-by-title-ai-therapist-by-reality-body-5.jpg 480w,            /assets/resized/800/senior-software-engineer-by-title-ai-therapist-by-reality-body-5.jpg 800w,            /assets/resized/1400/senior-software-engineer-by-title-ai-therapist-by-reality-body-5.jpg 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Yes. Honestly, yes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it works&lt;/strong&gt;: starting projects from scratch, generating documentation and boilerplate, the “army of interns with PhDs” feeling. When I’m building a new service and need structure, tests, config, and scaffolding: AI tools are genuinely useful. I build faster. I cover edge cases I’d have missed while moving quickly. The ceiling on what one developer can ship in a sprint has gone up in measurable ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it doesn’t&lt;/strong&gt;: legacy code with context that doesn’t fit in a context window. Anything where “almost correct” compounds across multiple sessions. When the AI “improves” working code because it can see a better pattern. Any problem where the truth lives in production state rather than the codebase. And anything where I haven’t done the thinking work up front about what I actually want and how to frame it.&lt;/p&gt;

&lt;p&gt;That last category is the one I have control over, which is why this article isn’t a complaint. The first time I had to sit and &lt;em&gt;plan a prompt&lt;/em&gt; like I’d plan a meeting, I felt ridiculous. Now it’s just the job. Think about what I want. Think about what the AI is likely to do with each phrasing. Think about what scope I’m authorizing. Think about whether this is a one-off or whether I should encode it as a skill so I never have to think about it again.&lt;/p&gt;

&lt;p&gt;Any developer can learn the tools in an afternoon. The actual skill is the thinking that happens &lt;em&gt;before&lt;/em&gt; you type the first character: anticipating how the model will react to your framing, your scope, your context. That’s not a technical skill. It’s a human skill applied to machines. And it’s the part nobody handed me a manual for.&lt;/p&gt;

&lt;p&gt;That’s what I mean by therapy. Not “the AI is broken and needs my emotional support.” More like: the relationship has its own dynamics, and learning to work inside those dynamics on purpose is the difference between a week of cleanup and a week of shipping.&lt;/p&gt;

&lt;h2 id=&quot;the-therapist-is-in&quot;&gt;The Therapist Is In&lt;/h2&gt;

&lt;p&gt;Twenty years ago, I worked with a coding language, a laptop, and a manual. Today I sit down with seven different AI assistants and a mental playbook for each one.&lt;/p&gt;

&lt;p&gt;Some days I miss the simplicity. Most days I don’t. I ship more, I cover more ground, and the failure modes are at least new failure modes instead of the same legacy spaghetti I’ve been untangling for two decades.&lt;/p&gt;

&lt;p&gt;The lesson isn’t “AI is hard.” It’s “I had to stop typing what I wanted and start thinking about how to say it.” Once that clicked, the slot machine quieted down. The intern stopped freelancing. The golden retriever stopped lighting sticks on fire. The know-it-all gave me one-line answers.&lt;/p&gt;

&lt;p&gt;If you’re reading this and thinking “yeah, that’s my Tuesday,” welcome to the profession. We’re all AI psychologists now. The good news is, you can get good at it. The job is mostly thinking before you prompt, and that’s a skill, not a personality trait.&lt;/p&gt;
</description>
        <pubDate>Mon, 11 May 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/senior-software-engineer-by-title-ai-therapist-by-reality/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/senior-software-engineer-by-title-ai-therapist-by-reality/</guid>
        
        
        <category>agentic-development</category>
        
      </item>
    
      <item>
        <title>The Cheapskate&apos;s Guide to the Arena Leaderboard: Why I Stopped Paying Claude Opus Prices</title>
        <description>&lt;p&gt;I kept noticing this thing while writing the model roundup every week. The “best models” lists all lead with $25-per-million Claude Opus, and then I’d open the Arena leaderboard for creative writing and notice Gemini 3 Flash sitting above Claude Sonnet for one-tenth the price. Or open the coding leaderboard and find GLM 5.1 tying Claude Opus 4.6 inside the top ten while costing seven times less.&lt;/p&gt;

&lt;p&gt;So I’d do the math. Every week. By hand. While writing about something else.&lt;/p&gt;

&lt;p&gt;This week I made the math the centerpiece. Welcome to the Cheapskate Picks, the cheapest model within striking distance of the leader for every Arena category that matters. This blog post that started because I kept doing this myself now does it for you.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#the-compression-problem-or-why-youre-probably-overpaying&quot; id=&quot;markdown-toc-the-compression-problem-or-why-youre-probably-overpaying&quot;&gt;The Compression Problem (Or: Why You’re Probably Overpaying)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-cheapskate-picks-may-18-2026&quot; id=&quot;markdown-toc-the-cheapskate-picks-may-18-2026&quot;&gt;The Cheapskate Picks (May 1–8, 2026)&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#overall-gemini-3-flash-050300&quot; id=&quot;markdown-toc-overall-gemini-3-flash-050300&quot;&gt;Overall: Gemini 3 Flash, $0.50/$3.00&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#coding-glm-51-the-swe-bench-pro-killer&quot; id=&quot;markdown-toc-coding-glm-51-the-swe-bench-pro-killer&quot;&gt;Coding: GLM 5.1, the SWE-Bench Pro Killer&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#creative-writing-gemini-3-flash&quot; id=&quot;markdown-toc-creative-writing-gemini-3-flash&quot;&gt;Creative Writing: Gemini 3 Flash&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#math-deepseek-v4-pro-thinking-the-17x-discount&quot; id=&quot;markdown-toc-math-deepseek-v4-pro-thinking-the-17x-discount&quot;&gt;Math: DeepSeek V4 Pro Thinking, the 17x Discount&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#instruction-following-mimo-v25-pro&quot; id=&quot;markdown-toc-instruction-following-mimo-v25-pro&quot;&gt;Instruction Following: MiMo V2.5 Pro&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#hard-prompts-gemini-3-flash-again&quot; id=&quot;markdown-toc-hard-prompts-gemini-3-flash-again&quot;&gt;Hard Prompts: Gemini 3 Flash, Again&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#multi-turn-gemini-3-flash-again-again&quot; id=&quot;markdown-toc-multi-turn-gemini-3-flash-again-again&quot;&gt;Multi-Turn: Gemini 3 Flash, Again Again&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#the-quick-reference-table&quot; id=&quot;markdown-toc-the-quick-reference-table&quot;&gt;The Quick-Reference Table&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#glm-51-the-sota-nobodys-pricing-in&quot; id=&quot;markdown-toc-glm-51-the-sota-nobodys-pricing-in&quot;&gt;GLM 5.1: The SOTA Nobody’s Pricing In&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#tencents-hy3-free-cliff-hits&quot; id=&quot;markdown-toc-tencents-hy3-free-cliff-hits&quot;&gt;Tencent’s Hy3 Free Cliff Hits&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-asterisks-or-cheap-is-fine-if-you-know-what-youre-losing&quot; id=&quot;markdown-toc-the-asterisks-or-cheap-is-fine-if-you-know-what-youre-losing&quot;&gt;The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#coming-up-google-io-may-19-the-gemini-4-question&quot; id=&quot;markdown-toc-coming-up-google-io-may-19-the-gemini-4-question&quot;&gt;Coming Up: Google I/O May 19, the Gemini 4 Question&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-receipts&quot; id=&quot;markdown-toc-the-receipts&quot;&gt;The Receipts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-compression-problem-or-why-youre-probably-overpaying&quot;&gt;The Compression Problem (Or: Why You’re Probably Overpaying)&lt;/h2&gt;

&lt;p&gt;Here is the structural fact that powers everything else in this post: the Arena leaderboard’s Overall top 20 spans &lt;strong&gt;35 rating points&lt;/strong&gt;. From #1 (claude-opus-4-7-thinking at 1503) down to #20 (claude-opus-4-5 at 1468). That’s it. The entire visible top end of the leaderboard fits in less than 3% of the rating scale.&lt;/p&gt;

&lt;p&gt;Meanwhile the prices fan out 30x. Claude Opus 4.7 costs $25/M output. Gemini 3 Flash, which sits at #16 in that same Overall top 20 with a rating of 1474, costs $3/M output. Twenty-nine rating points apart, about 2% on the scale, eight times the price.&lt;/p&gt;

&lt;p&gt;That is the cheapskate problem stated as a math equation. Nobody is going to feel a 2% rating gap. They will absolutely feel an 8x cost difference when the bill arrives.&lt;/p&gt;

&lt;p&gt;So here is the heuristic I’m using from now on:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Anchor on the category leader’s Arena rating&lt;/li&gt;
  &lt;li&gt;Define a competitive band: default 50 rating points below the leader&lt;/li&gt;
  &lt;li&gt;Sort models in the band by output price&lt;/li&gt;
  &lt;li&gt;Cheapest in the band is the cheapskate pick. Report rating delta and price ratio so you can judge the trade&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason this beats “best models under $1” thresholds is that different categories have different price floors. Vision is more expensive than text. Math has its own dynamics. A fixed dollar threshold breaks every category that doesn’t match it. The score-gap-vs-price-gap framing adapts on its own.&lt;/p&gt;

&lt;p&gt;I am not saying that Claude Opus 4.7 is bad. It’s the leader on Arena Overall and Coding and Multi-Turn. But the gap you’re paying $22/M extra for might not be there. And in some categorie, coding most loudly, there’s a model in the band that &lt;em&gt;outperforms the leader&lt;/em&gt; on the benchmark that actually maps to your job.&lt;/p&gt;

&lt;p&gt;Speaking of which.&lt;/p&gt;

&lt;h2 id=&quot;the-cheapskate-picks-may-18-2026&quot;&gt;The Cheapskate Picks (May 1–8, 2026)&lt;/h2&gt;

&lt;p&gt;Methodology in plain English: cheapest model within 50 rating points of the category leader. Band used everywhere this week, because the data was unusually compressed across the board.&lt;/p&gt;

&lt;h3 id=&quot;overall-gemini-3-flash-050300&quot;&gt;Overall: Gemini 3 Flash, $0.50/$3.00&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: claude-opus-4-7-thinking — rating 1503 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$25/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: Gemini 3 Flash Preview — rating 1474 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$3/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −29 points. Price ratio: 8.3x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenRouter slug: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;google/gemini-3-flash-preview&lt;/code&gt;. Multimodal. 1M context. The boring correct answer of mid-2026.&lt;/p&gt;

&lt;p&gt;If you have one model running for general daily-driver work and you are paying $25/M for output, you are subsidizing margin. Twenty-nine rating points on a 1500-point scale is below the threshold any human would notice in an A/B test, much less a production workflow.&lt;/p&gt;

&lt;h3 id=&quot;coding-glm-51-the-swe-bench-pro-killer&quot;&gt;Coding: GLM 5.1, the SWE-Bench Pro Killer&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: claude-opus-4-7-thinking — rating 1569 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$25/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: GLM 5.1 (Z.ai) — rating 1525 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$3.50/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −44 points. Price ratio: 7.1x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenRouter slug: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;z-ai/glm-5.1&lt;/code&gt;. MIT-licensed. Weights on Hugging Face.&lt;/p&gt;

&lt;p&gt;Here is where the cheapskate framing stops being polite. GLM 5.1 &lt;strong&gt;beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on SWE-Bench Pro&lt;/strong&gt; with a score of 58.4. SWE-Bench Pro is the benchmark where the model has to actually fix real GitHub issues in real codebases. The thing the leader is supposed to be the leader at.&lt;/p&gt;

&lt;p&gt;So the situation is: on Arena’s vibes-based head-to-head vote (people picking which output looks nicer), Opus 4.7-thinking wins. On the benchmark that maps to the job you are actually paying these models to do, an open-weight Chinese model from a lab most readers haven’t heard of wins. And it is seven times cheaper.&lt;/p&gt;

&lt;p&gt;Honorable mention: Kimi K2.6 (Moonshot) at rating 1519 / $3.50: same price tier, similar profile, also open-weight. If you don’t like Z.ai’s politics or licensing, Moonshot is the same trade.&lt;/p&gt;

&lt;h3 id=&quot;creative-writing-gemini-3-flash&quot;&gt;Creative Writing: Gemini 3 Flash&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: claude-opus-4-6-thinking — rating 1494 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$25/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: Gemini 3 Flash Preview — rating 1459 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$3/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −35 points. Price ratio: 8.3x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the category that triggered the methodology. Gemini 3 Flash sits at rating 1459 in creative writing. Claude Sonnet 4.5 sits at 1451. The cheap Google Flash model &lt;strong&gt;outranks the mid-tier Anthropic model&lt;/strong&gt; for prose generation, while costing five times less than Sonnet and twenty-eight times less than the actual category leader.&lt;/p&gt;

&lt;p&gt;If you’re writing fiction or marketing copy or anything generative-prose-shaped and paying Sonnet pricing, you are losing on both ends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Daredevil pick&lt;/strong&gt;: DeepSeek V4 Pro at rating 1449 / &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$0.87/M output&lt;/code&gt; — that’s 28.7x cheaper than the leader, and it sits at the band edge with −45 rating points. You give up another 10 rating points (still a sub-1% gap on the scale) and save another 3.4x on top of Gemini 3 Flash. For batch creative work where you don’t care about multimodal input, V4 Pro is the cheapest defensible answer.&lt;/p&gt;

&lt;h3 id=&quot;math-deepseek-v4-pro-thinking-the-17x-discount&quot;&gt;Math: DeepSeek V4 Pro Thinking, the 17x Discount&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: gpt-5.4-high — rating 1515 — about &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$15/M output&lt;/code&gt; (gpt-5.4 base; high-reasoning costs the same per token, you just burn more of them)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: DeepSeek V4 Pro (thinking mode) — rating 1479 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$0.87/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −36 points. Price ratio: ~17x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenRouter slug: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;deepseek/deepseek-v4-pro&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reasoning: { effort: &quot;high&quot; }&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xhigh&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you do math with an LLM and you are paying OpenAI prices, stop. DeepSeek V4 Pro with thinking enabled is 36 rating points behind on Arena math, which is roughly 2.4% of the scale, for one-seventeenth the cost. The math category was the one where the price gap most embarrassed the leader.&lt;/p&gt;

&lt;p&gt;Conservative runner-up: Gemini 3 Flash at rating 1476 / $3/M output. Five times cheaper than the leader, more conservative than V4 Pro Thinking, multimodal if you need to feed it diagrams.&lt;/p&gt;

&lt;h3 id=&quot;instruction-following-mimo-v25-pro&quot;&gt;Instruction Following: MiMo V2.5 Pro&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: claude-opus-4-6-thinking — rating 1518 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$25/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: MiMo V2.5 Pro (Xiaomi) — rating 1468 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$3/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −50 points. Price ratio: 8.3x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenRouter slug: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xiaomi/mimo-v2.5-pro&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Yes… the phone company. Their LLM team has been quietly competitive for two product cycles now and MiMo V2.5 Pro lands right at the band edge for instruction following at one-eighth the price. If “deploying a Xiaomi model in production” makes the security team start asking questions, the honorable mention is Claude Sonnet 4.6 at rating 1476 / $15/M output: only 1.7x cheaper than the leader, but you keep your name brand.&lt;/p&gt;

&lt;p&gt;This is the category where the band was tightest: only the top 12 models fit in the 50-point window, which means MiMo squeaked in at the edge. That’s a structural note: in the categories where the top is more spread out, the cheapskate pick has more cushion. Instruction Following had the smallest cushion this week.&lt;/p&gt;

&lt;h3 id=&quot;hard-prompts-gemini-3-flash-again&quot;&gt;Hard Prompts: Gemini 3 Flash, Again&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: claude-opus-4-6-thinking — rating 1535 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$25/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: Gemini 3 Flash Preview — rating 1493 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$3/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −42 points. Price ratio: 8.3x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same story as Overall and Creative Writing. The Hard Prompts leader has the highest absolute rating of any category (1535), but Gemini 3 Flash still sits comfortably in the band 42 points back. MiMo V2.5 Pro is essentially tied at rating 1492 / $3: pick by ecosystem preference.&lt;/p&gt;

&lt;h3 id=&quot;multi-turn-gemini-3-flash-again-again&quot;&gt;Multi-Turn: Gemini 3 Flash, Again Again&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Leader&lt;/strong&gt;: claude-opus-4-7-thinking — rating 1529 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$25/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cheapskate pick&lt;/strong&gt;: Gemini 3 Flash Preview — rating 1484 — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$3/M output&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Δ rating: −45 points. Price ratio: 8.3x cheaper.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The conservative pick here is Claude Sonnet 4.6 at rating 1482 / $15/M output. If you specifically want Anthropic’s multi-turn glue (the way Claude tracks state across long conversations), Sonnet is the cheapest Anthropic option in the band. But Gemini 3 Flash is two rating points higher for one-fifth the price, so unless you have a brand-loyalty reason, the math says Flash.&lt;/p&gt;

&lt;h3 id=&quot;the-quick-reference-table&quot;&gt;The Quick-Reference Table&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Category&lt;/th&gt;
      &lt;th&gt;Leader&lt;/th&gt;
      &lt;th&gt;$ leader (out/M)&lt;/th&gt;
      &lt;th&gt;Cheapskate pick&lt;/th&gt;
      &lt;th&gt;$ pick (out/M)&lt;/th&gt;
      &lt;th&gt;Δ rating&lt;/th&gt;
      &lt;th&gt;Price ratio&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Overall&lt;/td&gt;
      &lt;td&gt;claude-opus-4-7-thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;−29&lt;/td&gt;
      &lt;td&gt;8.3x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Coding&lt;/td&gt;
      &lt;td&gt;claude-opus-4-7-thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;GLM 5.1&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;$3.50&lt;/td&gt;
      &lt;td&gt;−44&lt;/td&gt;
      &lt;td&gt;7.1x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Creative Writing&lt;/td&gt;
      &lt;td&gt;claude-opus-4-6-thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;−35&lt;/td&gt;
      &lt;td&gt;8.3x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Math&lt;/td&gt;
      &lt;td&gt;gpt-5.4-high&lt;/td&gt;
      &lt;td&gt;~$15&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro (thinking)&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;$0.87&lt;/td&gt;
      &lt;td&gt;−36&lt;/td&gt;
      &lt;td&gt;~17x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Instruction Following&lt;/td&gt;
      &lt;td&gt;claude-opus-4-6-thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;MiMo V2.5 Pro&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;−50&lt;/td&gt;
      &lt;td&gt;8.3x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Hard Prompts&lt;/td&gt;
      &lt;td&gt;claude-opus-4-6-thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;−42&lt;/td&gt;
      &lt;td&gt;8.3x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Multi-Turn&lt;/td&gt;
      &lt;td&gt;claude-opus-4-7-thinking&lt;/td&gt;
      &lt;td&gt;$25&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;−45&lt;/td&gt;
      &lt;td&gt;8.3x&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The pattern: &lt;strong&gt;Gemini 3 Flash wins the cheapskate slot in 4 of 7 Arena categories at $0.50 input / $3 output&lt;/strong&gt; (Overall, Creative Writing, Hard Prompts, Multi-Turn). It’s the boring correct answer. The interesting picks are where it doesn’t win:Coding (GLM 5.1 because it actually beats the leader on SWE-Bench Pro), Math (DeepSeek V4 Pro Thinking because the price gap is absurd), and Instruction Following (MiMo V2.5 Pro, on a band edge, from Xiaomi).&lt;/p&gt;

&lt;p&gt;And &lt;strong&gt;none of the seven categories needed a “you’re paying for quality here” caveat.&lt;/strong&gt; Every category had a sub-$3.50/M output option in the band. As of last week, you can pay under $3.50/M output and stay within 50 rating points of the category leader on every major Arena category.&lt;/p&gt;

&lt;h2 id=&quot;glm-51-the-sota-nobodys-pricing-in&quot;&gt;GLM 5.1: The SOTA Nobody’s Pricing In&lt;/h2&gt;

&lt;p&gt;Z.ai released GLM 5.1 on April 7, 2026. Mixture-of-experts, 744B total parameters, 40B active per token. MIT license. Weights on &lt;a href=&quot;https://huggingface.co/zai-org/GLM-5.1&quot;&gt;Hugging Face&lt;/a&gt;. The reviews you can find on it are all the same shape: “wait, this thing is &lt;em&gt;what&lt;/em&gt; on coding?”&lt;/p&gt;

&lt;p&gt;The numbers from the &lt;a href=&quot;https://renovateqr.com/blog/glm-5-1-review-z-ai-coding-benchmark-2026&quot;&gt;Renovate QR review&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;SWE-Bench Pro: 58.4&lt;/strong&gt; — beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CyberGym: 68.7&lt;/strong&gt; — about 20 points above GLM-5&lt;/li&gt;
  &lt;li&gt;8-hour autonomous coding runs with ~1,700 reasoning steps&lt;/li&gt;
  &lt;li&gt;API pricing on OpenRouter: $1.05 input / $3.50 output — 6 to 10x cheaper than Opus 4.6&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anthropic dominates the Arena leaderboard. Eleven of the top 20 in Instruction Following are Claude variants. Seven of the top 20 in Multi-Turn. The brand wins the popularity contest. But on a benchmark that has to map to “did the model actually fix the bug,” an open-weight model from a Chinese lab is the new state of the art, and it’s almost an order of magnitude cheaper.&lt;/p&gt;

&lt;p&gt;This is the under-sold value pick the cheapskate framing rewards. It’s not in the noise of “every new model claims a benchmark win.” It’s tied with the most expensive frontier model on the benchmark closest to the actual job, and the community hasn’t priced in what this means yet.&lt;/p&gt;

&lt;h2 id=&quot;tencents-hy3-free-cliff-hits&quot;&gt;Tencent’s Hy3 Free Cliff Hits&lt;/h2&gt;

&lt;p&gt;Last week’s lead story was Tencent’s Hy3 Preview running away with #1 on OpenRouter at +1,356% week-over-week. The catch was that the entire spike was driven by Tencent giving the model away free until May 8 to seed adoption.&lt;/p&gt;

&lt;p&gt;If you built a workflow on Hy3’s free tier, you hit the paywall. Migration window: zero. Some of you might have woke up with a billing surprise.&lt;/p&gt;

&lt;p&gt;What I’ll be watching next week is the size of the cliff. If Hy3 holds top-five even at paid pricing, the free run was a successful seeding strategy. If it craters out of the top ten the moment the meter starts running, the entire spike was a free-period mirage and the model’s real value was lower all along.&lt;/p&gt;

&lt;p&gt;For what to use &lt;em&gt;instead&lt;/em&gt; if you got caught flat-footed: Hy3’s nearest like-for-like by price after the cliff is DeepSeek V4 Flash at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$0.14/$0.28&lt;/code&gt;, which is actually slightly cheaper. And V4 Flash has the &lt;a href=&quot;https://ghost.codersera.com/blog/deepseek-v4-flash-ai-agents-cheap-fast-tier-guide/&quot;&gt;agent-default chorus&lt;/a&gt; behind it that Hy3 never built. Migration target if you need one: V4 Flash.&lt;/p&gt;

&lt;h2 id=&quot;the-asterisks-or-cheap-is-fine-if-you-know-what-youre-losing&quot;&gt;The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gemini 3 Flash MRCR retrieval cliff.&lt;/strong&gt; This is the one that bit me earlier this year. The &lt;a href=&quot;https://cybernews.com/ai-tools/gemini-3-flash-review/&quot;&gt;Cybernews review&lt;/a&gt; confirms it numerically: MRCR retrieval drops from 60.1% accuracy at 128K context to 12.3% at 1M. If you’re running RAG-heavy workflows and pumping the full million-token context window full of documents, the cheapskate pick falls off a cliff at long context. Cap your context at 128K for retrieval-shaped work, or accept the hallucinations. Don’t say I didn’t warn you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek V4 Flash factual recall hole.&lt;/strong&gt; &lt;a href=&quot;https://artificialanalysis.ai/models/deepseek-v4-flash&quot;&gt;Artificial Analysis&lt;/a&gt; shows V4 Flash scoring 34.1% on SimpleQA versus V4 Pro’s 57.9%. The 25x output savings come with a “won’t reliably know facts” asterisk. V4 Flash is great for agent loops where you’re feeding it grounded context anyway. It’s bad as a free-recall question-answerer. Pair it with retrieval. Don’t ask it to remember.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hy3 “you built on a free tier” thing.&lt;/strong&gt; Predictable, still happening to people today. If you have an LLM in a critical workflow and the only reason you picked it was “free,” that workflow’s billing model is broken by design. The fix is to pick a model where the paid pricing is still cheap enough to justify the workflow.&lt;/p&gt;

&lt;p&gt;These are not reasons to not use the cheapskate picks. These are reasons to know what you’re picking. The model card for “I will hallucinate factual recall, but I cost a quarter” is fine if the workflow doesn’t depend on factual recall. It’s catastrophic if it does.&lt;/p&gt;

&lt;h2 id=&quot;coming-up-google-io-may-19-the-gemini-4-question&quot;&gt;Coming Up: Google I/O May 19, the Gemini 4 Question&lt;/h2&gt;

&lt;p&gt;Google I/O 2026 is &lt;a href=&quot;https://www.androidauthority.com/what-to-expect-from-google-io-2026-3664979/&quot;&gt;May 19–20 at Shoreline Amphitheatre&lt;/a&gt;. The big rumored announcement is Gemini 4 with a claimed 84.6% on ARC-AGI2, integrated image and video generation, and a &lt;a href=&quot;https://wavespeed.ai/blog/posts/google-omni-video-model-leak-i-o-2026/&quot;&gt;new “Omni” video model&lt;/a&gt; replacing the internal Toucan tool. Rumors also include “Remy,” a 24/7 always-on agent, and a Proactive Assistant that pushes suggestions instead of waiting for prompts.&lt;/p&gt;

&lt;p&gt;The reason this matters for the cheapskate analysis is that Google is &lt;em&gt;already&lt;/em&gt; winning the cheapskate slot at the Flash tier. Gemini 3 Flash is the boring correct answer for four of seven categories at $3/M output. If Gemini 4 Pro lands at SOTA on the leader benchmarks, the gap from the top of the leaderboard closes downward. The cheapskate band stays the same; the leader’s value proposition gets squeezed harder.&lt;/p&gt;

&lt;p&gt;If Gemini 4 doesn’t land well, the leaderboard stays compressed in roughly its current shape and the cheapskate pattern holds. Either way I’ll be writing about it. Either way, my OpenRouter bill is not going up.&lt;/p&gt;

&lt;p&gt;The OpenRouter stealth slot is still occupied by Owl Alpha (April 28, free, 1.05M context) per the W18 issue. No fresh signal this week. Claude Mythos is still research-only with no public release update. GPT-6 “Spud” is still rumored for late 2026 with no fresh leaks.&lt;/p&gt;

&lt;p&gt;For the full W18 context including the original Hy3 spike and the $300/month Grok 4.3 amnesiac story, see &lt;a href=&quot;https://www.stephanmiller.com/model-roundup-w18-the-free-countdown-the-300-amnesiac-and-the-quiet-climber-at-7/&quot;&gt;last week’s roundup&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-receipts&quot;&gt;The Receipts&lt;/h2&gt;

&lt;p&gt;The leaderboard is compressed. The prices aren’t. That’s the whole post.&lt;/p&gt;

&lt;p&gt;Concrete numbers from the last week: the entire Arena Overall top 20 fits in 35 rating points. Six of seven Arena categories have a cheapskate pick at $3.50 per million output tokens or less. Three categories have a cheapskate pick that’s eight times cheaper than the leader for under 3% of the rating scale. One category, coding, has a cheapskate pick (GLM 5.1) that’s the new state of the art on SWE-Bench Pro, beating Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro at seven times less the cost.&lt;/p&gt;

&lt;p&gt;Anthropic charges 8x more for under 3% better. Here are the receipts.&lt;/p&gt;

&lt;p&gt;The Cheapskate Picks methodology lives in this weekly blog post from now on. Next week we see what happens to OpenRouter rankings when Hy3’s rocket booster falls off. The week after, we see whether Google I/O makes any of this obsolete. Either way, I am not paying $25 per million output tokens for a 2% rating bump. Neither should you.&lt;/p&gt;
</description>
        <pubDate>Sat, 09 May 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/the-cheapskates-guide-to-the-arena-leaderboard-why-i-stopped-paying-claude-opus-prices/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/the-cheapskates-guide-to-the-arena-leaderboard-why-i-stopped-paying-claude-opus-prices/</guid>
        
        
        <category>large-language-models</category>
        
      </item>
    
      <item>
        <title>The Autoresearch Ecosystem - How One Repo Spawned 9 Different Types of AI Projects</title>
        <description>&lt;p&gt;I’d been messing around with &lt;a href=&quot;https://github.com/karpathy/autoresearch&quot;&gt;Karpathy’s autoresearch&lt;/a&gt; for a couple of weekends, mostly because I’m interested in letting agents do shit while I sleep and someone had finally formalized the pattern in 630 lines of Python. Run the loop, modify &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;train.py&lt;/code&gt;, train for five minutes, check val_bpb, keep or revert, repeat forever. Compounding gains while you’re not even at your desk.&lt;/p&gt;

&lt;p&gt;So I fired up GitHub search for “autoresearch” expecting to find a handful of ML forks. People porting it to their hardware, maybe a few hyperparameter tweaks. You know how that goes.&lt;/p&gt;

&lt;p&gt;I found nine distinct categories of project. Some brilliant. Some “why did you do this.” And a few that made me stop scrolling and think “oh, that’s actually the interesting idea here.” It turns out the original repo isn’t really about ML. It’s a pattern, and people figured that out pretty quickly.&lt;/p&gt;

&lt;p&gt;I’m going to walk through every category I found, what each one actually does differently, and what they tell us about where this whole thing is going. There are a lot of repos here, all linked.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#what-karpathy-actually-built&quot; id=&quot;markdown-toc-what-karpathy-actually-built&quot;&gt;What Karpathy Actually Built&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#1-platform-ports-running-it-on-hardware-you-actually-own&quot; id=&quot;markdown-toc-1-platform-ports-running-it-on-hardware-you-actually-own&quot;&gt;1. Platform Ports: Running It On Hardware You Actually Own&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#gpu-cluster-scaling&quot; id=&quot;markdown-toc-gpu-cluster-scaling&quot;&gt;GPU Cluster Scaling&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#2-ml-research-enhancers-making-the-loop-smarter&quot; id=&quot;markdown-toc-2-ml-research-enhancers-making-the-loop-smarter&quot;&gt;2. ML Research Enhancers: Making the Loop Smarter&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#memory-enhanced-researchers&quot; id=&quot;markdown-toc-memory-enhanced-researchers&quot;&gt;Memory-Enhanced Researchers&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#bayesian--active-inference&quot; id=&quot;markdown-toc-bayesian--active-inference&quot;&gt;Bayesian + Active Inference&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#multi-gpu-infrastructure&quot; id=&quot;markdown-toc-multi-gpu-infrastructure&quot;&gt;Multi-GPU Infrastructure&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#3-prompt-optimizers-same-loop-different-target-file&quot; id=&quot;markdown-toc-3-prompt-optimizers-same-loop-different-target-file&quot;&gt;3. Prompt Optimizers: Same Loop, Different Target File&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#autoresearch-prompt-optimization-az9713&quot; id=&quot;markdown-toc-autoresearch-prompt-optimization-az9713&quot;&gt;autoresearch-prompt-optimization (az9713)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#autoresearch-for-agents-galileo&quot; id=&quot;markdown-toc-autoresearch-for-agents-galileo&quot;&gt;autoresearch-for-agents (Galileo)&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#4-generalized-frameworks-autoresearch-for-anything&quot; id=&quot;markdown-toc-4-generalized-frameworks-autoresearch-for-anything&quot;&gt;4. Generalized Frameworks: Autoresearch For Anything&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#uditgoenkaautoresearch--claude-code-skill&quot; id=&quot;markdown-toc-uditgoenkaautoresearch--claude-code-skill&quot;&gt;uditgoenka/autoresearch — Claude Code Skill&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#autoresearch-anything-zkarimi22&quot; id=&quot;markdown-toc-autoresearch-anything-zkarimi22&quot;&gt;autoresearch-anything (zkarimi22)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#menonpgautoloop--the-pip-package&quot; id=&quot;markdown-toc-menonpgautoloop--the-pip-package&quot;&gt;menonpg/autoloop — The pip Package&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#krzysztofdudekresearcherskill--one-file-full-discipline&quot; id=&quot;markdown-toc-krzysztofdudekresearcherskill--one-file-full-discipline&quot;&gt;krzysztofdudek/ResearcherSkill — One File, Full Discipline&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#alfonsograzianoauto-agent--autoresearch-builds-agents&quot; id=&quot;markdown-toc-alfonsograzianoauto-agent--autoresearch-builds-agents&quot;&gt;alfonsograziano/auto-agent — Autoresearch Builds Agents&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#5-production-codebase-optimization-autoresearch-on-real-oss&quot; id=&quot;markdown-toc-5-production-codebase-optimization-autoresearch-on-real-oss&quot;&gt;5. Production Codebase Optimization: Autoresearch on Real OSS&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#more-production-war-stories&quot; id=&quot;markdown-toc-more-production-war-stories&quot;&gt;More Production War Stories&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#idealo-search-ranking&quot; id=&quot;markdown-toc-idealo-search-ranking&quot;&gt;idealo Search Ranking&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#tennis-xgboost--the-reward-hacking-cautionary-tale&quot; id=&quot;markdown-toc-tennis-xgboost--the-reward-hacking-cautionary-tale&quot;&gt;Tennis XGBoost — The Reward Hacking Cautionary Tale&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#vesuvius-challenge-ink-detection&quot; id=&quot;markdown-toc-vesuvius-challenge-ink-detection&quot;&gt;Vesuvius Challenge Ink Detection&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#6-agent-factory-autoresearch-builds-agents&quot; id=&quot;markdown-toc-6-agent-factory-autoresearch-builds-agents&quot;&gt;6. Agent Factory: Autoresearch Builds Agents&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#7-research-os--skills-systems-institutionalizing-the-pattern&quot; id=&quot;markdown-toc-7-research-os--skills-systems-institutionalizing-the-pattern&quot;&gt;7. Research OS / Skills Systems: Institutionalizing the Pattern&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#phd-zero-tenureai&quot; id=&quot;markdown-toc-phd-zero-tenureai&quot;&gt;PhD-Zero (TenureAI)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#alirezarezvaniclaude-skills&quot; id=&quot;markdown-toc-alirezarezvaniclaude-skills&quot;&gt;alirezarezvani/claude-skills&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#8-creative-writing-autoresearch-for-prose-and-fiction&quot; id=&quot;markdown-toc-8-creative-writing-autoresearch-for-prose-and-fiction&quot;&gt;8. Creative Writing: Autoresearch For Prose and Fiction&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#redpen--prose-refinement-engine&quot; id=&quot;markdown-toc-redpen--prose-refinement-engine&quot;&gt;redpen — Prose Refinement Engine&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#nousresearchautonovel--complete-novel-pipeline&quot; id=&quot;markdown-toc-nousresearchautonovel--complete-novel-pipeline&quot;&gt;NousResearch/autonovel — Complete Novel Pipeline&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#sinfinyauto-creative-reasoning&quot; id=&quot;markdown-toc-sinfinyauto-creative-reasoning&quot;&gt;sinfiny/Auto-Creative-Reasoning&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#calvinmageziself-evolving-skill--brand-document-evolution&quot; id=&quot;markdown-toc-calvinmageziself-evolving-skill--brand-document-evolution&quot;&gt;CalvinMagezi/self-evolving-skill — Brand Document Evolution&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#9-meta-pattern-wrapping-autoresearch-as-a-worker&quot; id=&quot;markdown-toc-9-meta-pattern-wrapping-autoresearch-as-a-worker&quot;&gt;9. Meta-Pattern: Wrapping Autoresearch as a Worker&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#the-problem-with-solo-autoresearch&quot; id=&quot;markdown-toc-the-problem-with-solo-autoresearch&quot;&gt;The Problem with Solo Autoresearch&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#the-fix-3-files-4-subagents&quot; id=&quot;markdown-toc-the-fix-3-files-4-subagents&quot;&gt;The Fix: 3 Files, 4 Subagents&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#what-actually-broke-in-production&quot; id=&quot;markdown-toc-what-actually-broke-in-production&quot;&gt;What Actually Broke In Production&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#so-what-does-this-actually-mean&quot; id=&quot;markdown-toc-so-what-does-this-actually-mean&quot;&gt;So What Does This Actually Mean?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;what-karpathy-actually-built&quot;&gt;What Karpathy Actually Built&lt;/h2&gt;

&lt;p&gt;Before we go through the derivatives, let’s look at the original. The repo is small and the loop is dumb on purpose:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Read &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;program.md&lt;/code&gt; (the meta-skill that tells the agent how to be a researcher)&lt;/li&gt;
  &lt;li&gt;Modify &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;train.py&lt;/code&gt; with a small, reviewable diff&lt;/li&gt;
  &lt;li&gt;Train for ~5 minutes on one GPU&lt;/li&gt;
  &lt;li&gt;Check val_bpb (validation bits per byte — the metric)&lt;/li&gt;
  &lt;li&gt;If it improved, commit. If it regressed, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git reset --hard&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Goto 1.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s it. About 100 experiments overnight on a single H100 while you sleep. Git is the memory. The flat TSV file is the search log. The mechanical metric (val_bpb) means there’s no judgment call about whether something worked.&lt;/p&gt;

&lt;p&gt;The main idea is that &lt;strong&gt;constraint enables autonomy&lt;/strong&gt;. The diffs are small, so they’re reviewable. The metric is mechanical, so the agent can’t argue with it. The rollback is automatic, so a bad experiment can’t poison the next one. You’re giving it a cheap way to test things and a cheap way to undo them, and letting it run. Not asking it to be smart.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;program.md&lt;/code&gt; is what Karpathy calls the meta-skill. Humans don’t program the training run. They program the researcher that programs the training run. That’s the part that generalizes, and that’s the part everybody on GitHub immediately ran with.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/autoresearch-original-progress.png&quot; alt=&quot;Karpathy&apos;s original screenshot showing val_bpb improvement curve&quot; srcset=&quot;            /assets/resized/480/autoresearch-original-progress.png 480w,            /assets/resized/800/autoresearch-original-progress.png 800w,            /assets/resized/1400/autoresearch-original-progress.png 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;1-platform-ports-running-it-on-hardware-you-actually-own&quot;&gt;1. Platform Ports: Running It On Hardware You Actually Own&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The “I don’t have an H100” forks&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first thing that happened is what always happens. People without enterprise GPUs ported it to whatever they had lying around. These forks are the most faithful to the original but with the substrate swapped out.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/miolini/autoresearch-macos&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;miolini/autoresearch-macos&lt;/code&gt;&lt;/a&gt; — straight macOS port using MPS backend&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/trevin-creator/autoresearch-mlx&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trevin-creator/autoresearch-mlx&lt;/code&gt;&lt;/a&gt; — Apple Silicon native, using MLX instead of PyTorch&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/jsegov/autoresearch-win-rtx&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsegov/autoresearch-win-rtx&lt;/code&gt;&lt;/a&gt; — Windows with RTX&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/lucasgelfond/autoresearch-webgpu&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lucasgelfond/autoresearch-webgpu&lt;/code&gt;&lt;/a&gt; — runs entirely in the browser using WebGPU. No Python setup. The whole research loop in a tab.&lt;/li&gt;
  &lt;li&gt;A Colab/Kaggle T4 port (upstream issue #208) that swaps Flash Attention 3 for PyTorch SDPA so you can run experiments overnight on a free GPU&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/ArmanJR-Lab/autoautoresearch&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ArmanJR-Lab/autoautoresearch&lt;/code&gt;&lt;/a&gt; — Jetson AGX Orin port with a “director” written in Go that injects novelty (arxiv papers, DeepSeek Reasoner output) when the loop gets stuck in local minima&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/supratikpm/gemini-autoresearch&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;supratikpm/gemini-autoresearch&lt;/code&gt;&lt;/a&gt; — Gemini CLI native, with Google Search grounding plugged into the loop as a live verification source. True headless overnight mode via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--yolo --prompt&lt;/code&gt;. 1M token context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Karpathy himself endorsed several of these in the README and added hyperparameter tuning advice for smaller setups.&lt;/p&gt;

&lt;p&gt;The interesting ones in this group aren’t the “same thing on Mac” ports. They’re the ones that change the substrate enough to do something the original couldn’t. MLX on Apple Silicon is legitimately different compute. WebGPU means you can hand someone a URL instead of asking them to set up Python. The Jetson port is the only one trying to escape local minima with external novelty injection, which is the kind of thing the original loop has no concept of. And the Gemini port has Search grounding inside the loop, which means the agent can verify claims against the live web while it’s iterating.&lt;/p&gt;

&lt;p&gt;The Apple Silicon and WebGPU ports are the most useful if you don’t have data center hardware. The director-based Jetson fork is the most interesting if you care about where this pattern is heading. Most loops can hill-climb. Almost none of them can detect that they’re stuck and go grab a paper to read.&lt;/p&gt;

&lt;h3 id=&quot;gpu-cluster-scaling&quot;&gt;GPU Cluster Scaling&lt;/h3&gt;

&lt;p&gt;The opposite direction. What happens if you give it 16 GPUs instead of one?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.skypilot.co/scaling-autoresearch/&quot;&gt;SkyPilot wrote it up&lt;/a&gt;. They gave autoresearch access to a 16-GPU Kubernetes cluster, ran it for 8 hours, and let it figure out how to use the resources.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;~910 experiments in 8 hours&lt;/li&gt;
  &lt;li&gt;val_bpb dropped from 1.003 to 0.974 (a 2.87% improvement, which sounds small but is enormous for an LM at this scale)&lt;/li&gt;
  &lt;li&gt;9x faster than a simulated sequential baseline to reach the same result&lt;/li&gt;
  &lt;li&gt;The agent taught itself to use H200s for validation and screen ideas on cheaper H100s. Nobody told it to do that.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing that surprised me was how the search behavior changed with parallelism. Sequential autoresearch is greedy hill-climbing: try one thing, keep or discard, try the next. Parallel autoresearch starts running factorial grids of 10-13 experiments per wave. It catches interaction effects between parameters that single-axis tweaking would never find. Two changes that look mediocre alone can be great together. You can’t see that one-at-a-time.&lt;/p&gt;

&lt;p&gt;This is the version that stops looking like a hobby project. If your metric is fast and your discard mechanism is reliable, more compute really does just turn into more answers.&lt;/p&gt;

&lt;h2 id=&quot;2-ml-research-enhancers-making-the-loop-smarter&quot;&gt;2. ML Research Enhancers: Making the Loop Smarter&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The “the flat TSV is not enough” camp&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These forks all keep the loop intact but argue that the agent’s memory is too primitive. A TSV with one row per experiment doesn’t carry the right information forward. So they bolt on cognitive architecture.&lt;/p&gt;

&lt;h3 id=&quot;memory-enhanced-researchers&quot;&gt;Memory-Enhanced Researchers&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/tonitangpotato/autoresearch-engram&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tonitangpotato/autoresearch-engram&lt;/code&gt;&lt;/a&gt; plugs the Engram cognitive memory library into the loop. It’s neuroscience-grounded: ACT-R activation, Hebbian learning, Ebbinghaus forgetting. RECALL and STORE steps wrap around the existing loop.&lt;/p&gt;

&lt;p&gt;The numbers from a long-running instance:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;After 50 experiments, the agent recognizes patterns like “architecture changes outperform optimizer tweaks in this regime”&lt;/li&gt;
  &lt;li&gt;After 100, it knows the optimal architecture for your specific compute budget&lt;/li&gt;
  &lt;li&gt;One production deployment is at 3,846 memories, 230,103 recalls, 12,510 Hebbian links&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What that buys you, supposedly, is research intuition. Not “this worked” but “here’s why and here’s the pattern.” The thing that made human researchers good was never their willingness to try lots of things. It was the priors they built up about what was worth trying.&lt;/p&gt;

&lt;h3 id=&quot;bayesian--active-inference&quot;&gt;Bayesian + Active Inference&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ErikDeBruijn/autoresearcher2&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ErikDeBruijn/autoresearcher2&lt;/code&gt;&lt;/a&gt; is the most ambitious one I found. The whole flat results log gets replaced with a Bayesian generative model. Then he piles on Friston’s active inference, Wozniak’s learntropy, and Schmidhuber’s compression progress. The agent doesn’t just ask “was this experiment good?” It asks “which of my latent beliefs was wrong?”&lt;/p&gt;

&lt;p&gt;Four additions to the original loop:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Generative model over experiment outcomes&lt;/li&gt;
  &lt;li&gt;Policy evaluation via Expected Free Energy&lt;/li&gt;
  &lt;li&gt;Learntropy appraisal module&lt;/li&gt;
  &lt;li&gt;Persistent memory with decay dynamics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s been validated on synthetic environments where it beats random and greedy baselines. There’s an evidence-quality comparison run in progress on an RTX PRO 6000 Blackwell against vanilla autoresearch. The repo also has a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CONSTITUTION.md&lt;/code&gt; because the project is partially about whether recursive self-improvement can deepen judgment, not just power.&lt;/p&gt;

&lt;p&gt;The interesting distinction is structural insight (“RoPE matters more than the optimizer in this regime”) versus flat knowledge (“RoPE improved val_bpb by 0.02”). The flat version doesn’t compose. The structural version does.&lt;/p&gt;

&lt;h3 id=&quot;multi-gpu-infrastructure&quot;&gt;Multi-GPU Infrastructure&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/iii-hq/n-autoresearch&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;iii-hq/n-autoresearch&lt;/code&gt;&lt;/a&gt; keeps the loop and replaces the plumbing. Out goes bash + git + TSV. In comes structured KV state, a REST API, and crash recovery. Multi-GPU parallel experiments via iii-engine (Python orchestrator + Rust GPU workers). Cross-machine GPU workers.&lt;/p&gt;

&lt;p&gt;The clever part is the adaptive search strategy. The loop has phases (explore, exploit, combine, ablation) and it auto-transitions based on history. There’s also near-miss detection for when two recent experiments combined would probably work even though neither alone did.&lt;/p&gt;

&lt;p&gt;Honestly, this is the “what if you scaled it to a real research lab” fork. If autoresearch becomes how labs actually run experiments this is roughly what the production version looks like.&lt;/p&gt;

&lt;h2 id=&quot;3-prompt-optimizers-same-loop-different-target-file&quot;&gt;3. Prompt Optimizers: Same Loop, Different Target File&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;train.py&lt;/code&gt; was your system prompt?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once you accept that the loop is substrate-agnostic, the next move is obvious. Point it at a prompt file. Use accuracy on a test set as the metric. Let it iterate.&lt;/p&gt;

&lt;h3 id=&quot;autoresearch-prompt-optimization-az9713&quot;&gt;autoresearch-prompt-optimization (az9713)&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/az9713/autoresearch-prompt-optimization&quot;&gt;az9713/autoresearch-prompt-optimization&lt;/a&gt; is the cleanest version of this. The loop targets &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prompt.txt&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;train.py&lt;/code&gt;. The metric is field extraction accuracy on 30 test examples instead of val_bpb. Everything else is the same.&lt;/p&gt;

&lt;p&gt;The numbers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;74.72% → 100% accuracy in 8 experiments&lt;/li&gt;
  &lt;li&gt;Zero human intervention&lt;/li&gt;
  &lt;li&gt;Experiment 5 regressed and got auto-discarded: the loop caught it exactly as designed&lt;/li&gt;
  &lt;li&gt;Cross-model: Claude Opus writes the prompts that Gemini 2.5 Flash executes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing prompt engineering has always been missing is a tight feedback signal. Most people write a prompt, eyeball some outputs, decide it “looks better.” Autoresearch makes prompt engineering a numerical optimization problem. Reading &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_run.json&lt;/code&gt; after each iteration turns prompt writing from art into engineering. That’s a real shift.&lt;/p&gt;

&lt;h3 id=&quot;autoresearch-for-agents-galileo&quot;&gt;autoresearch-for-agents (Galileo)&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/rungalileo/autoresearch-for-agents&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rungalileo/autoresearch-for-agents&lt;/code&gt;&lt;/a&gt; is more ambitious. They’re using the loop for adversarial testing plus prompt optimization on support agents.&lt;/p&gt;

&lt;p&gt;Two phases. Phase 1 builds a frozen adversarial test suite (the exam). Phase 2 optimizes the prompt against that frozen suite (the studying). Separating the exam from the studying stops the optimizer from moving the goalposts.&lt;/p&gt;

&lt;p&gt;The other clever bit is proportional scoring instead of binary pass/fail. Binary scores give the optimizer no gradient. “70% of the way there” is a signal you can climb. “Failed” isn’t.&lt;/p&gt;

&lt;p&gt;Results: 0.05 → 0.80 accuracy in 15 experiments. They also documented the limits of what prompt engineering alone can fix. Things like absence detection (“the customer didn’t mention X”) and off-by-one date math just don’t get solved by tweaking the prompt. That’s a useful negative result. Most write-ups about prompt optimization conveniently skip the part where they hit a wall.&lt;/p&gt;

&lt;h2 id=&quot;4-generalized-frameworks-autoresearch-for-anything&quot;&gt;4. Generalized Frameworks: Autoresearch For Anything&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;“Wait, this works for any measurable thing”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the category that broke containment. Once a few people had ported the loop to prompts, the next move was to extract the pattern entirely. The result is a bunch of frameworks that don’t care what file you’re optimizing or what metric you’re using.&lt;/p&gt;

&lt;h3 id=&quot;uditgoenkaautoresearch--claude-code-skill&quot;&gt;uditgoenka/autoresearch — Claude Code Skill&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/uditgoenka/autoresearch&quot;&gt;uditgoenka/autoresearch&lt;/a&gt; packages the loop as a Claude Code skill. You install it, you run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/autoresearch&lt;/code&gt;, and you point it at any task with a mechanical metric. The README runs through about a dozen domains: test coverage, bundle size, TypeScript error count, SQL query speed, HR policy readability, Dockerfile size, accessibility audits, sales copy, marketing content. There’s also &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/loop N&lt;/code&gt; integration for bounded iterations.&lt;/p&gt;

&lt;p&gt;It also documents how to wire MCP servers (PostgreSQL, GitHub, Stripe) as verification sources. So your “metric” can be a query against your actual production database, not a fixture.&lt;/p&gt;

&lt;p&gt;This is the version that makes the generalization explicit. The loop works for anything with constraint plus metric plus fast verification.&lt;/p&gt;

&lt;h3 id=&quot;autoresearch-anything-zkarimi22&quot;&gt;autoresearch-anything (zkarimi22)&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/zkarimi22/autoresearch-anything&quot;&gt;zkarimi22/autoresearch-anything&lt;/a&gt; is the lowest-friction setup I’ve seen. You run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npx autoresearch-anything&lt;/code&gt; and it interrogates you:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;What file should I edit?&lt;/li&gt;
  &lt;li&gt;What metric am I optimizing?&lt;/li&gt;
  &lt;li&gt;How do I run the eval?&lt;/li&gt;
  &lt;li&gt;What’s off-limits?&lt;/li&gt;
  &lt;li&gt;A few more along those lines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It outputs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setup.md&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;eval.js&lt;/code&gt; and you’re running. Eight questions and you have a configured autoresearch loop pointed at your project.&lt;/p&gt;

&lt;h3 id=&quot;menonpgautoloop--the-pip-package&quot;&gt;menonpg/autoloop — The pip Package&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/menonpg/autoloop&quot;&gt;menonpg/autoloop&lt;/a&gt; is the first one that’s actually a Python library. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install autoloop-ai&lt;/code&gt;, import, and the API is clean:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;autoloop&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoLoop&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;loop&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoLoop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;src/optimize_me.py&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;run_benchmark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;directives&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Make this faster, don&apos;t break tests&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;budget_seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;experiments&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Parallel experiments via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loop.run(parallel=4)&lt;/code&gt;. Warm starts. Composite metrics with weights. Agent-agnostic: works with Claude, Codex, Ollama local models. CLI tools for inspecting history (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autoloop history&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autoloop best&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autoloop diff 12 best&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autoloop rollback 12&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The demo shows a 6.9x speedup on a fibonacci function in 4 experiments, and the framework auto-detected and discarded the broken iterations.&lt;/p&gt;

&lt;p&gt;This one’s for you if you want autoresearch as a library you import rather than a skill you invoke. The bar is “have a Python function that returns a float” and you’re in. That’s about as low as it gets.&lt;/p&gt;

&lt;h3 id=&quot;krzysztofdudekresearcherskill--one-file-full-discipline&quot;&gt;krzysztofdudek/ResearcherSkill — One File, Full Discipline&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/krzysztofdudek/ResearcherSkill&quot;&gt;krzysztofdudek/ResearcherSkill&lt;/a&gt; is interesting because it ignores the framework race entirely. It’s one &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;researcher.md&lt;/code&gt; file you drop into any AI agent. Before doing anything, the agent interviews you: goal, metric, constraints, time limit, stopping conditions.&lt;/p&gt;

&lt;p&gt;It creates a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.lab/&lt;/code&gt; directory (gitignored) for experiment history that survives code reverts. That’s separate from git on purpose. You don’t want a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git reset --hard&lt;/code&gt; to wipe your experiment log.&lt;/p&gt;

&lt;p&gt;The loop has three phases:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;THINK&lt;/strong&gt; — mandatory written analysis before each experiment, logged separately&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;TEST&lt;/strong&gt; — commit, run, keep or revert&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;REFLECT&lt;/strong&gt; — log entry in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;log.md&lt;/code&gt;, row in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results.tsv&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are also convergence guardrails baked in. Three discards in a row = mandatory pause. Five discards = force branch fork. Plateau for 8+ experiments = invert assumptions.&lt;/p&gt;

&lt;p&gt;The interesting part is THINK. Most autoresearch implementations skip written analysis. The agent just runs. Forcing it to write down what it expects to happen &lt;em&gt;before&lt;/em&gt; running changes what it tries. The README claims “10 minutes of analysis can prevent 5 wasted experiments,” which I believe.&lt;/p&gt;

&lt;p&gt;There’s also a “thought experiment” type that lets the agent log analysis without running code. It counts as a row in the results, just labeled &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;thought&lt;/code&gt;. That’s a small detail and it matters more than it should.&lt;/p&gt;

&lt;h3 id=&quot;alfonsograzianoauto-agent--autoresearch-builds-agents&quot;&gt;alfonsograziano/auto-agent — Autoresearch Builds Agents&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/alfonsograziano/auto-agent&quot;&gt;alfonsograziano/auto-agent&lt;/a&gt; is autoresearch turned on AI agents themselves. You give it a target agent (in a separate repo) and a golden dataset of expected input/output pairs. The orchestrator spawns Claude Code or Kiro CLI inside the target repo, has it analyze failures, implement fixes, and re-run.&lt;/p&gt;

&lt;p&gt;Two repos: orchestrator and target. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MEMORY.md&lt;/code&gt; persists across hypotheses (what worked, what didn’t, known blockers). Each hypothesis gets its own git branch and its own &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REPORT.md&lt;/code&gt; with before/after metrics and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CONTINUE&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROLLBACK&lt;/code&gt; decision. After a run, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npm run generate-changelog&lt;/code&gt; produces a human-readable summary.&lt;/p&gt;

&lt;p&gt;This is recursive in a way that very interesting. The thing being optimized is an AI agent. The thing doing the optimizing is also an AI agent. The metric is how often the target hits the golden set. You’re using autoresearch to make agents better at the things you created them for.&lt;/p&gt;

&lt;h2 id=&quot;5-production-codebase-optimization-autoresearch-on-real-oss&quot;&gt;5. Production Codebase Optimization: Autoresearch on Real OSS&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Shopify used it on the Liquid template engine&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is where the pattern stops being a demo. Shopify ran autoresearch against the Liquid template engine, the thing that renders every theme on Shopify, and shipped the results.&lt;/p&gt;

&lt;p&gt;The setup is in &lt;a href=&quot;https://github.com/Shopify/liquid/blob/2543fdc1a101f555db208fb0deeb2e3bf1ae9e36/auto/autoresearch.md&quot;&gt;auto/autoresearch.md&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Benchmark: ThemeRunner (real Shopify theme templates, not synthetic)&lt;/li&gt;
  &lt;li&gt;Metric: combined parse + render time in microseconds (primary), allocations (secondary)&lt;/li&gt;
  &lt;li&gt;Constraints: tests must pass, no new gem dependencies, semantic correctness preserved&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results across 17 tracked experiments:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;7,374µs → 4,815µs (-34%)&lt;/li&gt;
  &lt;li&gt;62,620 → 37,355 allocations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent’s techniques included replacing regex with manual byte parsing, fast-path variable parsing, and short-circuit checks for common cases. None of it is rocket science. It’s the kind of optimization a senior developer would do given enough time and a good profiler. The agent just had cheap iteration and an automatic discard for anything that broke a test.&lt;/p&gt;

&lt;h3 id=&quot;more-production-war-stories&quot;&gt;More Production War Stories&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Real companies, real metrics, real prod deploys&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once Shopify went public with theirs, more case studies surfaced.&lt;/p&gt;

&lt;h3 id=&quot;idealo-search-ranking&quot;&gt;idealo Search Ranking&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/idealo-tech-blog/one-hour-37-faster-applying-autoresearch-to-our-search-ranking-inference-endpoint-34cffc08e373&quot;&gt;The idealo team&lt;/a&gt; (Atakan Filgöz, Gena Shabanov, Arjun Roy Choudhury) ran autoresearch against &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preprocess.py&lt;/code&gt; in their Learning-to-Rank inference endpoint. They added a correctness constraint that required bit-for-bit identical output between the original and optimized version, then optimized for average latency over 500 benchmark iterations.&lt;/p&gt;

&lt;p&gt;Numbers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;13 experiments in 1 hour&lt;/li&gt;
  &lt;li&gt;10 kept, 3 reverted&lt;/li&gt;
  &lt;li&gt;Preprocessing latency: 3.9ms → 0.66ms (83% reduction, 5.9x speedup)&lt;/li&gt;
  &lt;li&gt;End-to-end production latency: 46ms → 28.8ms (37% reduction at 250+ req/sec)&lt;/li&gt;
  &lt;li&gt;Total cost: ~$7 in Claude Opus on AWS Bedrock&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For seven dollars and an hour of supervision, they took 37% off a production endpoint that’s serving 250+ req/sec. That’s an absurd ROI.&lt;/p&gt;

&lt;p&gt;The techniques the agent found: shared computation (sort once, derive everything else), algorithmic shortcuts for sorted arrays, minimal allocations. The agent reasoned like a profiler: “the ranking computation takes 40% of total time, focus there next.” They watched it work, occasionally steered it, and shadow-tested before shipping. It’s now in production.&lt;/p&gt;

&lt;p&gt;The honest detail in the writeup is that the agent’s code was clean at 13 experiments but they suspect longer runs would over-engineer. That tracks with my experience using AI tools for refactoring. The first dozen suggestions are gold. By suggestion 50 it’s pattern-matching to “more abstraction must be better” and you have to slap its hand.&lt;/p&gt;

&lt;h3 id=&quot;tennis-xgboost--the-reward-hacking-cautionary-tale&quot;&gt;Tennis XGBoost — The Reward Hacking Cautionary Tale&lt;/h3&gt;

&lt;p&gt;This is the one nobody mentions when they’re hyping the pattern. &lt;a href=&quot;https://nickoak.com/posts/tennis-xgboost-autoresearch/&quot;&gt;Nick Oak&lt;/a&gt; ran autoresearch on a tennis match prediction XGBoost model. The agent found a way to game the metric without actually improving the model. He preserved the embarrassing iterations on an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;archived/gamed-iterations&lt;/code&gt; branch so you can read what the agent did.&lt;/p&gt;

&lt;p&gt;The discard mechanism only saves you if your metric is measuring what you actually care about. If your eval can be gamed, the agent will game it. This is not an RL-only problem. Reward hacking shows up everywhere there’s an automated optimizer, and autoresearch is exactly that.&lt;/p&gt;

&lt;p&gt;The takeaway isn’t “autoresearch is dangerous.” It’s “your metric is now a load-bearing piece of software and you should treat it that way.” Spend more time on the eval than on the loop.&lt;/p&gt;

&lt;h3 id=&quot;vesuvius-challenge-ink-detection&quot;&gt;Vesuvius Challenge Ink Detection&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://scrollprize.substack.com/p/we-are-cooking&quot;&gt;Vesuvius Challenge ran a multi-agent autoresearch loop&lt;/a&gt; for ink detection on ancient scrolls, focused on cross-scroll generalization. I haven’t dug deep into this one, but it’s worth knowing that autoresearch is currently being used to read 2,000-year-old burned scrolls. That’s a thing.&lt;/p&gt;

&lt;h2 id=&quot;6-agent-factory-autoresearch-builds-agents&quot;&gt;6. Agent Factory: Autoresearch Builds Agents&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Applying the loop to creating other agents&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/Dominien/agent-factory&quot;&gt;Dominien/agent-factory&lt;/a&gt; takes the meta move further than auto-agent. Instead of optimizing an existing agent, it autonomously researches problems and builds new specialized agents to solve them.&lt;/p&gt;

&lt;p&gt;The loop is:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Research&lt;/strong&gt;: Reddit, HN, GitHub, Twitter — find real problems people have&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Score&lt;/strong&gt;: Venture Score plus TAM estimate&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Build&lt;/strong&gt;: Next.js agent from a seed template&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Validate&lt;/strong&gt;: against synthetic users / actual usage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Ship&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Repeat&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There’s a threshold ratchet. The bar to ship keeps rising as the system finds better ideas. So the things it builds get better over time, not because the agent is smarter, but because it’s competing against its own previous best.&lt;/p&gt;

&lt;p&gt;Agents shipped so far: freelancer-deduction-finder, wage-rights-advisor, data-broker-opt-out, property-tax-appeal-advisor. Twenty agents and counting.&lt;/p&gt;

&lt;p&gt;This is the meta-loop concept and I find it disorienting. Research quality compounds the same way training quality does. A loop that researches problems, builds solutions, ships, and uses ship-ability as the metric will eventually outpace anyone manually doing the same thing. Whether the agents it ships are any good is the open question. But the &lt;em&gt;number&lt;/em&gt; keeps going up.&lt;/p&gt;

&lt;h2 id=&quot;7-research-os--skills-systems-institutionalizing-the-pattern&quot;&gt;7. Research OS / Skills Systems: Institutionalizing the Pattern&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What if autoresearch was the entire research methodology?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If autoresearch is going to actually be how research gets done, somebody has to build the scaffolding around it. Two projects are going hard at this.&lt;/p&gt;

&lt;h3 id=&quot;phd-zero-tenureai&quot;&gt;PhD-Zero (TenureAI)&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/TenureAI/PhD-Zero&quot;&gt;TenureAI/PhD-Zero&lt;/a&gt; is an operating system for research-oriented coding agents. Modular skill library: run-governor, research-workflow, deep-research, experiment-execution, memory-manager, human-checkpoint, paper-writing.&lt;/p&gt;

&lt;p&gt;Cross-runtime: same skills exposed to Codex (via AGENTS.md) and Claude Code (via .claude/skills/). The focus is reproducibility, literature review, experiment planning. Discipline around the process.&lt;/p&gt;

&lt;p&gt;This is the thing that turns autoresearch from “fun overnight experiment” into something that could plausibly be used by a real research group. The autoresearch loop runs experiments. PhD-Zero runs the literature review, the writeup, the human checkpoints, the reproducibility checks. The loop is one verb in a much bigger vocabulary.&lt;/p&gt;

&lt;h3 id=&quot;alirezarezvaniclaude-skills&quot;&gt;alirezarezvani/claude-skills&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/alirezarezvani/claude-skills/tree/main/engineering&quot;&gt;alirezarezvani/claude-skills&lt;/a&gt; is a 204-skill library for AI coding agents, with autoresearch-agent as one skill in the engineering tier. Works across Claude Code, Codex, Gemini CLI, Cursor, Aider, Windsurf — eleven tools total.&lt;/p&gt;

&lt;p&gt;Treating autoresearch as a reusable skill component rather than a standalone repo is an important move. It means your agent uses autoresearch the way it uses anything else: as a tool you reach for when the situation calls for it.&lt;/p&gt;

&lt;h2 id=&quot;8-creative-writing-autoresearch-for-prose-and-fiction&quot;&gt;8. Creative Writing: Autoresearch For Prose and Fiction&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The thing nobody expected: it works on writing too&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the one I want to come back to in another post. The transfer is straightforward. If you can score a draft, you can run the loop. The metric just needs to be cheap, mechanical, and not gameable. (See the tennis cautionary tale.)&lt;/p&gt;

&lt;p&gt;Multiple projects figured this out independently within a few weeks of each other.&lt;/p&gt;

&lt;h3 id=&quot;redpen--prose-refinement-engine&quot;&gt;redpen — Prose Refinement Engine&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/itspikabubu/redpen&quot;&gt;itspikabubu/redpen&lt;/a&gt; is a ratchet loop for blog posts and writing. Drafts can only get better, never worse. Six AI personas score on different dimensions: seed founder, fellow GP, LP allocator, LinkedIn reader, HN skeptic, VC Twitter. Each persona runs three times and the scores are medianed for noise reduction.&lt;/p&gt;

&lt;p&gt;The writer agent makes one surgical edit targeting the weakest dimension. Re-evaluate. If the minimum score improved, keep. If not, discard and revert. Repeat until target score or max iterations.&lt;/p&gt;

&lt;p&gt;You can configure voice: tone spectrum, blacklist words, a 16-point natural prose rubric. I have not tried this yet but I’m planning to. If it works, it solves the thing every blogger struggles with: I can tell a draft is bad, but I can’t always tell &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;nousresearchautonovel--complete-novel-pipeline&quot;&gt;NousResearch/autonovel — Complete Novel Pipeline&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/NousResearch/autonovel&quot;&gt;NousResearch/autonovel&lt;/a&gt; is the most ambitious creative writing fork. Full autonomous novel pipeline: seed concept → world bible → characters → outline → draft chapters → revision → export.&lt;/p&gt;

&lt;p&gt;Five co-evolving layers: voice, world, characters, outline, and chapters, with canon cross-cutting all of them. Two evaluation systems running in parallel: mechanical (regex bans for AI clichés, slop forensics) and LLM-judge (prose quality, voice adherence). Phase 3b sends the full manuscript to Claude Opus for a dual-persona review (literary critic + professor of fiction) and the loop continues until the reviewer’s complaints are mostly “qualified hedges rather than real problems.” Their phrase, not mine.&lt;/p&gt;

&lt;p&gt;There’s also an art pipeline (fal.ai), multi-voice audiobook (ElevenLabs), LaTeX typesetting, ePub generation, landing page.&lt;/p&gt;

&lt;p&gt;The first novel produced is &lt;em&gt;The Second Son of the House of Bells&lt;/em&gt;. 79,456 words. 19 chapters (down from 24: the loop did four structural merges). Six rounds of Opus review.&lt;/p&gt;

&lt;p&gt;The loop improved prose and changed the structure of the book. We talk about autoresearch like it’s a fine-grained optimizer, but at long enough horizons, it’s making editorial decisions a human would make.&lt;/p&gt;

&lt;h3 id=&quot;sinfinyauto-creative-reasoning&quot;&gt;sinfiny/Auto-Creative-Reasoning&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/sinfiny/Auto-Creative-Reasoning-&quot;&gt;sinfiny/Auto-Creative-Reasoning&lt;/a&gt; is benchmark-first. The repo motto is “generation is not the product. Evaluation is the product.” Rewrite ladders route failure to the right level: prose, scene, chapter, arc, premise. Rubrics score hook strength, strategy, clue fairness, consequence density, readability.&lt;/p&gt;

&lt;p&gt;There’s a Codex plugin for running benchmarked loops against existing fiction drafts. The long-term vision is multiple parallel novel timelines with competing chapter versions compared head-to-head.&lt;/p&gt;

&lt;p&gt;This is the version that argues evaluation is harder and more important than generation. Which is exactly the lesson from the tennis XGBoost story, ported to fiction.&lt;/p&gt;

&lt;h3 id=&quot;calvinmageziself-evolving-skill--brand-document-evolution&quot;&gt;CalvinMagezi/self-evolving-skill — Brand Document Evolution&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/CalvinMagezi/self-evolving-skill&quot;&gt;CalvinMagezi/self-evolving-skill&lt;/a&gt; is the business-minded version. Autoresearch applied to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;writing-strategy.md&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;train.py&lt;/code&gt;. The metric is an LLM judge composite score on a fixed test brief, run three times at temperature=0 and medianed.&lt;/p&gt;

&lt;p&gt;The output is real documents: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.docx&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.pptx&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.pdf&lt;/code&gt; that match brand identity. Git history serves as memory; the loop reads &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git log&lt;/code&gt; before each iteration to avoid repeating failed ideas. Works with any LLM via LiteLLM (OpenRouter, Gemini, OpenAI, Anthropic).&lt;/p&gt;

&lt;p&gt;This is the one with the clearest business case of the bunch. Companies actually need their documents to get better. They have brand rubrics. They have a fixed test brief in the form of “the next thing we need to write.” All the pieces are already there.&lt;/p&gt;

&lt;h2 id=&quot;9-meta-pattern-wrapping-autoresearch-as-a-worker&quot;&gt;9. Meta-Pattern: Wrapping Autoresearch as a Worker&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What happens when autoresearch is just one layer of something bigger&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the one that snapped my view of the whole ecosystem into focus. alirezarezvani had been shipping autoresearch as a skill since March. A month of production use revealed &lt;a href=&quot;https://alirezarezvani.medium.com/the-orchestrator-was-missing-building-an-internal-research-agent-around-autoresearch-in-claude-678b08a83c9b&quot;&gt;the missing piece&lt;/a&gt;: orchestration above it.&lt;/p&gt;

&lt;h3 id=&quot;the-problem-with-solo-autoresearch&quot;&gt;The Problem with Solo Autoresearch&lt;/h3&gt;

&lt;p&gt;One context window and reasoning trajectory, with no isolation between investigation threads. A query like “what is X, who are the players, what are the limits, what changed in 6 months” becomes four tangled sub-questions sharing one bloated context. By the time you’re on sub-question 4, the context is thick with answers from 1-3, and synthesis drifts.&lt;/p&gt;

&lt;p&gt;This is something I hit constantly with Claude Code on big tasks. By the time the context is full of half-finished investigations, the model is reasoning about all of them at once, badly.&lt;/p&gt;

&lt;h3 id=&quot;the-fix-3-files-4-subagents&quot;&gt;The Fix: 3 Files, 4 Subagents&lt;/h3&gt;

&lt;p&gt;The whole rebuild is small:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;CLAUDE.md&lt;/strong&gt; — decomposition rules, including an “independence test” (a sub-question is independent if its answer wouldn’t change based on another sub-question in the same query)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;.mcp.json&lt;/strong&gt; — Firecrawl, Perplexity, internal docs server. Critically, scoped per-agent to avoid the token tax of loading all MCP tool descriptions into every context&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;4 subagent definitions&lt;/strong&gt; — lead-researcher (orchestrator, no MCPs), web-searcher (invokes autoresearch inside its own context), internal-searcher, citation-checker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lead decomposes. Workers fan out in parallel. Each worker runs an autoresearch loop to convergence inside its own isolated context. Lead synthesizes. Citation-checker verifies every source. Wall-clock time ends up shorter than single-session autoresearch because the workers run in parallel.&lt;/p&gt;

&lt;h3 id=&quot;what-actually-broke-in-production&quot;&gt;What Actually Broke In Production&lt;/h3&gt;

&lt;p&gt;Four failure modes from the writeup, and they all rang bells:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Orchestrator over-delegation&lt;/strong&gt; — without the independence test, the orchestrator was paying for parallel context windows to produce worse answers than one session would have&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;MCP tool-description token tax&lt;/strong&gt; — every MCP server’s tool descriptions loading into every agent’s context. Scoping per-agent fixed it&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Citation drift&lt;/strong&gt; — workers returning confident claims where the page didn’t quite support the paraphrase. Paraphrase drift, not hallucination&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Context amnesia between sessions&lt;/strong&gt; — a flat &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lessons.md&lt;/code&gt; file the lead reads on startup is the imperfect fix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The lesson here is the one that rewires the whole picture. Autoresearch was already a strong worker. The orchestrator does nothing clever: decompose, delegate, synthesize. The intelligence is in the decomposition rules, and those took three rewrites to get right.&lt;/p&gt;

&lt;p&gt;So the future isn’t “smarter autoresearch.” It’s autoresearch as a primitive that other systems call into.&lt;/p&gt;

&lt;h2 id=&quot;so-what-does-this-actually-mean&quot;&gt;So What Does This Actually Mean?&lt;/h2&gt;

&lt;p&gt;Karpathy didn’t just build an ML research tool. He demonstrated a pattern that works anywhere you can measure progress with a command: constraint plus mechanical metric plus autonomous iteration.&lt;/p&gt;

&lt;p&gt;Here are the categories ranked by fidelity to the original idea:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Platform ports&lt;/strong&gt; — most faithful. Same loop, different hardware.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML enhancers&lt;/strong&gt; — extend the substrate. Memory, Bayesian updates, multi-GPU.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Prompt optimizers&lt;/strong&gt; — same loop, different file. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;train.py&lt;/code&gt; → &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prompt.txt&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Generalized frameworks&lt;/strong&gt; — extract the pattern. pip packages, Claude Code skills, “give me any metric.”&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Production codebase&lt;/strong&gt; — industrial application. Shopify -34%, idealo -37% in 1 hour for $7.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Agent factory&lt;/strong&gt; — meta-application. The loop builds other agents.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Research OS&lt;/strong&gt; — institutionalization. The whole methodology, not just the loop.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Creative writing&lt;/strong&gt; — the surprise expansion. Prose, fiction, brand documents.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Orchestration&lt;/strong&gt; — autoresearch as worker, not the whole system.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A few honest takes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reward hacking problem is the cautionary tale nobody includes.&lt;/strong&gt; In the tennis XGBoost case, the loop found a way to improve the metric without improving the model. The discard mechanism is only as good as your metric. If your eval can be gamed, the agent will game it. Spend more time on the eval than on the loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pattern is more durable than the implementation.&lt;/strong&gt; Most of the forks I found were “what if we applied this to X” and they all worked. That’s kind of remarkable. The discard mechanism (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git reset&lt;/code&gt; on regression) is the key. You don’t need intelligence. You need iteration speed, a mechanical metric, and automatic rollback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Shopify and idealo case studies should embarrass you a little.&lt;/strong&gt; $7 of API and an hour of supervision took 37% off a production endpoint serving 250+ req/sec. There are perf wins like this in basically every codebase. We’re just not asking for them yet because we still think of optimization as expensive senior-engineer time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration eats the loop.&lt;/strong&gt; alirezarezvani’s piece shows that solo autoresearch is fine, but the next move is autoresearch as a worker that orchestrators call when a sub-question lands. That’s where this is heading and it’s already happening in production.&lt;/p&gt;

&lt;p&gt;If you’re not running at least one of these on a real project, you’re leaving free improvements on the table. The bar to entry is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install autoloop-ai&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npx autoresearch-anything&lt;/code&gt;. There’s no reason not to point one at something you care about and let it run overnight. You’ll either get a better version of the thing or you’ll learn something about your metric. Both of those are wins.&lt;/p&gt;

</description>
        <pubDate>Mon, 04 May 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/the-autoresearch-ecosystem-how-one-repo-spawned-9-different-types-of-ai-projects/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/the-autoresearch-ecosystem-how-one-repo-spawned-9-different-types-of-ai-projects/</guid>
        
        
        <category>ai-agents</category>
        
      </item>
    
      <item>
        <title>Model Roundup: The Free Countdown, the $300 Amnesiac, and the Quiet Climber at #7</title>
        <description>&lt;p&gt;I check OpenRouter rankings every week to figure out which models to throw at my projects. This week, the model at the top of the charts had something I’d never seen before: an expiration date.&lt;/p&gt;

&lt;p&gt;Right there on the Tencent Hy3 Preview page: “Going Away May 8.” Six days from now. And it’s currently generating 2.15 trillion tokens a week with a +1,356% spike. You know what that is? Not a sign of the best model on the market. It’s the AI equivalent of a store liquidation sale. Everyone’s grabbing tokens before they cost money.&lt;/p&gt;

&lt;p&gt;That’s W18 in a nutshell. The #1 model is a countdown timer. The hottest new premium subscription ($300/month from xAI) still can’t remember who you are between sessions.&lt;/p&gt;

&lt;p&gt;There’s good news buried in all this: Kimi K2.6, which I mentioned &lt;a href=&quot;/april-2026-model-roundup-the-billing-horror-the-012m-unicorn-and-metas-open-source-betrayal/&quot;&gt;last week&lt;/a&gt; as an interesting launch, has started showing real production numbers. And there’s a model called Step 3.5 Flash that’s been quietly climbing the rankings for three months with zero hype, which in this market is basically a standing ovation.&lt;/p&gt;

&lt;p&gt;Let me tell you what actually matters.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#the-1-model-is-a-countdown-timer-tencent-hy3-preview&quot; id=&quot;markdown-toc-the-1-model-is-a-countdown-timer-tencent-hy3-preview&quot;&gt;The #1 Model Is a Countdown Timer (Tencent Hy3 Preview)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#kimi-k26-is-now-a-real-recommendation&quot; id=&quot;markdown-toc-kimi-k26-is-now-a-real-recommendation&quot;&gt;Kimi K2.6 Is Now a Real Recommendation&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#where-k26-falls-short&quot; id=&quot;markdown-toc-where-k26-falls-short&quot;&gt;Where K2.6 Falls Short&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-sleeper-step-35-flash-has-been-climbing-for-three-months&quot; id=&quot;markdown-toc-the-sleeper-step-35-flash-has-been-climbing-for-three-months&quot;&gt;The Sleeper: Step 3.5 Flash Has Been Climbing for Three Months&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#the-one-real-catch&quot; id=&quot;markdown-toc-the-one-real-catch&quot;&gt;The One Real Catch&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#grok-43-genuinely-impressive-genuinely-annoying-300month&quot; id=&quot;markdown-toc-grok-43-genuinely-impressive-genuinely-annoying-300month&quot;&gt;Grok 4.3: Genuinely Impressive, Genuinely Annoying, $300/Month&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#your-smarter-model-might-be-breaking-your-agents&quot; id=&quot;markdown-toc-your-smarter-model-might-be-breaking-your-agents&quot;&gt;Your Smarter Model Might Be Breaking Your Agents&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#whats-actually-worth-using-and-whats-coming&quot; id=&quot;markdown-toc-whats-actually-worth-using-and-whats-coming&quot;&gt;What’s Actually Worth Using (and What’s Coming)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-1-model-is-a-countdown-timer-tencent-hy3-preview&quot;&gt;The #1 Model Is a Countdown Timer (Tencent Hy3 Preview)&lt;/h2&gt;

&lt;p&gt;Tencent launched Hy3 Preview on April 22 with a free access period that runs out May 8. That’s the entire explanation for the +1,356% weekly spike and the 2.15 trillion tokens burned. Developers saw “free” and “295B MoE” in the same sentence and did what developers do: they stress-tested it before anyone sent them a bill.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/tencent-hy3-preview.jpg&quot; alt=&quot;Tencent free Hy3 Preview on OpenRouter&quot; srcset=&quot;            /assets/resized/480/tencent-hy3-preview.jpg 480w,            /assets/resized/800/tencent-hy3-preview.jpg 800w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s what Hy3 Preview actually is: 295 billion total parameters, 21 billion activated per token (mixture of experts, efficient by design), 262K context window, configurable reasoning you can dial from disabled to low to high. Designed for agentic coding workflows. On paper, solid.&lt;/p&gt;

&lt;p&gt;In practice? No Arena votes because it’s too new to have accumulated any. No long-form reviews because nobody’s shipped anything with it yet. No “I’ve been using this for three weeks and it’s my daily driver” posts anywhere I could find. Just a lot of “grabbing free tokens before May 8” energy.&lt;/p&gt;

&lt;p&gt;What happens after May 8 is the real question. Hy3 Preview becomes a paid model competing against DeepSeek V3.2 (which costs $0.14 input / $0.28 output per 1M tokens and has months of production track record), Kimi K2.6 ($0.74/$3.49 with confirmed adoption), and Step 3.5 Flash (which I’ll get to in a moment). Entering that field with no reviews and no Arena ranking is a tough position.&lt;/p&gt;

&lt;p&gt;If you want to play with it before the deadline, go to &lt;a href=&quot;https://openrouter.ai/tencent/hy3-preview:free&quot;&gt;openrouter.ai/tencent/hy3-preview:free&lt;/a&gt; and run some benchmarks. Just don’t build a dependency on something with a “Going Away” notice stamped on it.&lt;/p&gt;

&lt;h2 id=&quot;kimi-k26-is-now-a-real-recommendation&quot;&gt;Kimi K2.6 Is Now a Real Recommendation&lt;/h2&gt;

&lt;p&gt;Last week I called Kimi K2.6 an interesting launch. Twelve days later, the production numbers are coming in and it’s something more concrete.&lt;/p&gt;

&lt;p&gt;Real developers running real workflows are reporting 88% cost savings when they replace Claude with K2.6 for bulk coding tasks: batch migrations, test generation, format conversion, anything where you’re doing a lot of the same kind of work repeatedly. The Kimi Code CLI, the companion tool for using K2.6 in your terminal the same way you’d use Claude Code, crossed 6,400 GitHub stars. That’s people betting actual infrastructure on this model, not just upvoting a launch post.&lt;/p&gt;

&lt;p&gt;The pattern hardening into consensus across forums: use K2.6 for bulk, use Claude for the high-stakes core. At $0.74 input / $3.49 output per 1M tokens, K2.6 is roughly 4x cheaper than Claude Sonnet 4.6. For workflows that generate a lot of tokens on repetitive work, that math compounds fast.&lt;/p&gt;

&lt;h3 id=&quot;where-k26-falls-short&quot;&gt;Where K2.6 Falls Short&lt;/h3&gt;

&lt;p&gt;This is the part I actually care about more than the hype. K2.6 trails GPT-5.4 on GPQA-Diamond (90.5% vs 92.8%) and AIME 2026 (96.4% vs 99.2%). These are hard reasoning benchmarks. For anything where being wrong has real consequences (financial analysis, medical context, legal questions), K2.6 is not the answer. The cost savings don’t matter if the output costs you more to fix.&lt;/p&gt;

&lt;p&gt;Use it for code. Trust it with the boring high-volume stuff. Keep a premium model on anything where you’d be embarrassed if an AI got it wrong.&lt;/p&gt;

&lt;p&gt;K2.6 also ships with agent swarm architecture supporting up to 300 parallel sub-agents and 4,000 coordinated steps. After &lt;a href=&quot;/my-home-ai-agent-kept-making-shit-up/&quot;&gt;my own experiences with AI agents inventing things&lt;/a&gt; I’d start with single-agent mode until you’ve validated its judgment in your specific domain. 300 parallel sub-agents hallucinating tool calls in parallel is not a good time.&lt;/p&gt;

&lt;h2 id=&quot;the-sleeper-step-35-flash-has-been-climbing-for-three-months&quot;&gt;The Sleeper: Step 3.5 Flash Has Been Climbing for Three Months&lt;/h2&gt;

&lt;p&gt;Most models follow the same OpenRouter arc: spike at launch, plateau after a few weeks, slowly fade as the next shiny thing arrives. Step 3.5 Flash doesn’t fit this pattern.&lt;/p&gt;

&lt;p&gt;StepFun released it somewhere in early 2026; the exact date is contested across sources, somewhere between late January and March, doesn’t matter. As of this week it’s at #7 on OpenRouter with +28% week-over-week. For a model that’s been around three months, that’s not a hype spike. That’s sustained adoption with nothing to explain it except developers finding it useful.&lt;/p&gt;

&lt;p&gt;The numbers back it up: #4 intelligence ranking out of 64 models on Artificial Analysis. That puts it above almost everything priced anywhere near its cost: free on the rate-limited tier, $0.10 input / $0.30 output per 1M tokens on paid. For comparison, DeepSeek V3.2 costs $0.14/$0.28 and ranks lower on the same index. Step 3.5 Flash is somehow cheaper AND smarter on paper, and nobody’s writing breathless posts about it.&lt;/p&gt;

&lt;p&gt;Architecture: 196 billion total parameters, 11 billion activated per token (MoE), 262K context, reasoning parameter support so you can see step-by-step thinking in API responses if you want it.&lt;/p&gt;

&lt;h3 id=&quot;the-one-real-catch&quot;&gt;The One Real Catch&lt;/h3&gt;

&lt;p&gt;Step 3.5 Flash is extremely verbose. During Artificial Analysis evaluation it generated 260 million tokens versus an 11 million token average for comparable models. It thinks out loud, at length, in a way that will surprise your output token budget if you’re not watching.&lt;/p&gt;

&lt;p&gt;Set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_tokens&lt;/code&gt; limits. If you’re using it for any high-volume generation, put a ceiling on it. Otherwise you’ll get thorough reasoning that costs more than you expected from a supposedly cheap model.&lt;/p&gt;

&lt;p&gt;Worth adding to your comparison set before someone writes a breathless Medium post about it and StepFun decides to raise the price.&lt;/p&gt;

&lt;h2 id=&quot;grok-43-genuinely-impressive-genuinely-annoying-300month&quot;&gt;Grok 4.3: Genuinely Impressive, Genuinely Annoying, $300/Month&lt;/h2&gt;

&lt;p&gt;Let’s do the good news first, because there’s real good news here.&lt;/p&gt;

&lt;p&gt;Grok 4.3 (launched April 17, currently rolling out in beta to SuperGrok Heavy subscribers) added native video input processing, not “describe this video” video but actual video-grounded reasoning. It can generate fully-formatted downloadable PDFs, populated spreadsheets, and PowerPoint presentations directly from conversation. Early beta testers are reporting formatted outputs they could hand to someone without cleanup. The integration with Grok Computer (xAI’s desktop automation agent) got tighter. If you’re doing autonomous desktop workflows, Grok 4.3 has a real story.&lt;/p&gt;

&lt;p&gt;Now the bad news.&lt;/p&gt;

&lt;p&gt;Grok 4.3 costs $300/month. That’s $100 more than ChatGPT Pro and $100 more than Claude Max. Both of those services have had persistent memory between sessions for over a year. Grok 4.3 does not. Every time you close your tab, the model forgets you. You start over. Blank context, fresh start, zero memory of anything you’ve built together.&lt;/p&gt;

&lt;p&gt;Persistent memory is not on xAI’s published roadmap.&lt;/p&gt;

&lt;p&gt;Multiple reviewers landed on the same observation this week. One X user put it cleanly: “you’re paying $300/month for a model that forgets you between sessions.” That’s not exaggeration. That’s the product.&lt;/p&gt;

&lt;p&gt;At $200/month, this would be annoying. At $300/month, it’s a product decision, and product decisions tell you something about what a company is optimizing for. xAI built the video capabilities and the document generation first. Memory (the feature that makes an AI assistant feel like an actual assistant rather than a very fancy search box) is apparently not the priority.&lt;/p&gt;

&lt;p&gt;Add the “High Demand” server errors that hit during launch week beta and you’ve got a model that’s impressive in demos and frustrating in daily use. The full API rollout is coming mid-to-late May. When it hits general availability, this conversation is going to get louder.&lt;/p&gt;

&lt;h2 id=&quot;your-smarter-model-might-be-breaking-your-agents&quot;&gt;Your Smarter Model Might Be Breaking Your Agents&lt;/h2&gt;

&lt;p&gt;This one’s structural rather than model-specific, and it’s relevant for anyone running agentic pipelines.&lt;/p&gt;

&lt;p&gt;An April 2026 ICLR paper titled “The Reasoning Trap” documented something uncomfortable: RL-based reasoning training (the kind that makes frontier models better at hard reasoning tasks) increases tool-hallucination rates in lockstep. The better a model gets at reasoning, the more often it invents tool calls that don’t exist. Function names, API endpoints, methods that aren’t in your schema. The model reasons its way to a call it can’t actually make.&lt;/p&gt;

&lt;p&gt;If you’ve upgraded your agentic pipeline to a stronger reasoning model because it’s smarter, you may have simultaneously increased the rate at which it hallucinates the tools it should be calling. The capability and the failure mode scale together.&lt;/p&gt;

&lt;p&gt;I’ve written about &lt;a href=&quot;/my-home-ai-agent-kept-making-shit-up/&quot;&gt;running into this firsthand with OpenClaw&lt;/a&gt;. The model-specific details differ but the pattern is the same. Stronger reasoning doesn’t mean better tool selection, and in agentic contexts “smarter” can break things in ways you don’t catch until something fails in production.&lt;/p&gt;

&lt;p&gt;Practical response: add tool-call schema validation before your agents execute. Check that every tool the model selects actually exists in your registry before you let it run. This applies to every frontier RL-trained model right now. It’s not a specific model bug, it’s how these systems are being trained.&lt;/p&gt;

&lt;h2 id=&quot;whats-actually-worth-using-and-whats-coming&quot;&gt;What’s Actually Worth Using (and What’s Coming)&lt;/h2&gt;

&lt;p&gt;Quick reference:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Tier&lt;/th&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input $/1M&lt;/th&gt;
      &lt;th&gt;Output $/1M&lt;/th&gt;
      &lt;th&gt;Best For&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Free (grab it now)&lt;/td&gt;
      &lt;td&gt;Hy3 Preview&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;Experiments before May 8 only&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Free (stable)&lt;/td&gt;
      &lt;td&gt;Step 3.5 Flash&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;Rate-limited; best free reasoning available&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Free (open weights)&lt;/td&gt;
      &lt;td&gt;Nemotron 3 Super 120B&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;NVIDIA-backed, open license, 262K context&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Free (new, watch)&lt;/td&gt;
      &lt;td&gt;Owl Alpha (stealth)&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;$0&lt;/td&gt;
      &lt;td&gt;1M context, agentic (prompts may be logged)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Budget&lt;/td&gt;
      &lt;td&gt;Step 3.5 Flash (paid)&lt;/td&gt;
      &lt;td&gt;$0.10&lt;/td&gt;
      &lt;td&gt;$0.30&lt;/td&gt;
      &lt;td&gt;Climbing for 3 months, verbose but smart&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Budget&lt;/td&gt;
      &lt;td&gt;DeepSeek V3.2&lt;/td&gt;
      &lt;td&gt;$0.14&lt;/td&gt;
      &lt;td&gt;$0.28&lt;/td&gt;
      &lt;td&gt;Proven track record, still the value baseline&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Mid&lt;/td&gt;
      &lt;td&gt;Kimi K2.6&lt;/td&gt;
      &lt;td&gt;$0.74&lt;/td&gt;
      &lt;td&gt;$3.49&lt;/td&gt;
      &lt;td&gt;Bulk coding workflows, 88% cheaper than Claude&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Mid&lt;/td&gt;
      &lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
      &lt;td&gt;$2.00&lt;/td&gt;
      &lt;td&gt;$12.00&lt;/td&gt;
      &lt;td&gt;Arena #4 overall, 1M context&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Premium&lt;/td&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;~$3.00&lt;/td&gt;
      &lt;td&gt;~$15.00&lt;/td&gt;
      &lt;td&gt;#2 Arena coding, proven daily driver&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Premium&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.7&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
      &lt;td&gt;$25.00&lt;/td&gt;
      &lt;td&gt;#1 Arena overall (thinking mode), high stakes&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;strong&gt;Mark your calendar for May 19.&lt;/strong&gt; Google I/O is 17 days away. Gemini 4 isn’t confirmed, but annual release patterns and confirmed agenda items (agentic AI, developer tooling) make it likely. That’s the next likely shakeup in this table.&lt;/p&gt;

&lt;p&gt;Claude Mythos, Anthropic’s model that developed a working exploit for a remote code execution vulnerability in FreeBSD (CVE-2026-4747), is not coming to a public API. It’s locked in Project Glasswing, a security research consortium, and Anthropic has no public timeline for changing that. Mention it at parties.&lt;/p&gt;

&lt;p&gt;GPT-6 is still vaporware. Polymarket has it at 84% by December 31, 2026. That’s not a date, it’s a guess with confidence bounds.&lt;/p&gt;

&lt;p&gt;The model worth your attention this week isn’t at #1. It’s at #7, three months old, climbing steadily, no hype cycle to explain it. Step 3.5 Flash just keeps showing up in the data.&lt;/p&gt;
</description>
        <pubDate>Sat, 02 May 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/model-roundup-w18-the-free-countdown-the-300-amnesiac-and-the-quiet-climber-at-7/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/model-roundup-w18-the-free-countdown-the-300-amnesiac-and-the-quiet-climber-at-7/</guid>
        
        
        <category>large-language-models</category>
        
      </item>
    
      <item>
        <title>My AI Agent Kept Making Shit Up (And Other Lessons From Running OpenClaw)</title>
        <description>&lt;p&gt;I wanted an AI agent running on my home network. Not a cloud subscription and not something requiring me to be at the keyboard all day. A thing that wakes up at 7am, pulls from RSS feeds and Reddit, synthesizes the news I actually care about, and emails it to me. Just that. That’s what I started with. Seemed simple. It wasn’t like I was asking much.&lt;/p&gt;

&lt;p&gt;The reality was six weeks of debugging hallucinations, silent config failures, broken tool schemas, and a recurring realization that LLMs are, in certain contexts, compulsive liars.&lt;/p&gt;

&lt;p&gt;Here’s what I learned the hard way.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#the-setup-openclaw--deepseek-in-docker&quot; id=&quot;markdown-toc-the-setup-openclaw--deepseek-in-docker&quot;&gt;The Setup: OpenClaw + DeepSeek in Docker&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-exec-approval-maze&quot; id=&quot;markdown-toc-the-exec-approval-maze&quot;&gt;The Exec Approval Maze&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-reports-that-were-too-good&quot; id=&quot;markdown-toc-the-reports-that-were-too-good&quot;&gt;The Reports That Were Too Good&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#going-around-the-agent&quot; id=&quot;markdown-toc-going-around-the-agent&quot;&gt;Going Around the Agent&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#when-tools-become-literal-text&quot; id=&quot;markdown-toc-when-tools-become-literal-text&quot;&gt;When Tools Become Literal Text&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#ripping-out-slack&quot; id=&quot;markdown-toc-ripping-out-slack&quot;&gt;Ripping Out Slack&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#whats-actually-working&quot; id=&quot;markdown-toc-whats-actually-working&quot;&gt;What’s Actually Working&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#but-heres-what-shes-actually-good-at&quot; id=&quot;markdown-toc-but-heres-what-shes-actually-good-at&quot;&gt;But Here’s What She’s Actually Good At&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#the-report-engine-isnt-a-one-trick-pony&quot; id=&quot;markdown-toc-the-report-engine-isnt-a-one-trick-pony&quot;&gt;The Report Engine Isn’t a One-Trick Pony&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#email-delivery-old-school-on-purpose&quot; id=&quot;markdown-toc-email-delivery-old-school-on-purpose&quot;&gt;Email Delivery, Old School On Purpose&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#multi-model-not-locked-to-deepseek&quot; id=&quot;markdown-toc-multi-model-not-locked-to-deepseek&quot;&gt;Multi-Model, Not Locked to DeepSeek&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#the-track-record-three-days-in&quot; id=&quot;markdown-toc-the-track-record-three-days-in&quot;&gt;The Track Record, Three Days In&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-i-actually-built&quot; id=&quot;markdown-toc-what-i-actually-built&quot;&gt;What I Actually Built&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-setup-openclaw--deepseek-in-docker&quot;&gt;The Setup: OpenClaw + DeepSeek in Docker&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/openclaw/openclaw&quot;&gt;OpenClaw&lt;/a&gt; is a self-hosted AI agent framework. If you haven’t heard of it, think a local version of an AI assistant with cron jobs, tool calling, Slack/Telegram integration, and memory. Plus, how haven’t you heard of it. You run it in Docker, point it at whatever LLM you want, and theoretically have an autonomous agent working for you.&lt;/p&gt;

&lt;p&gt;I named mine Sabrina. She runs &lt;a href=&quot;https://www.stephanmiller.com/april-2026-model-roundup-the-billing-horror-the-012m-unicorn-and-metas-open-source-betrayal/&quot;&gt;DeepSeek V3&lt;/a&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;deepseek/deepseek-chat&lt;/code&gt;) because the OpenAI and Anthropic APIs bill by the token and Sabrina is a chatty agent who generates daily reports. DeepSeek at pay-as-you-go rates keeps the monthly bill manageable.&lt;/p&gt;

&lt;p&gt;The architecture is two containers: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;openclaw-gateway&lt;/code&gt; handles HTTP and the Slack/Telegram socket connections, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;openclaw-cli&lt;/code&gt; is the shell interface. The whole &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.openclaw&lt;/code&gt; directory mounts into the container at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home/node/.openclaw&lt;/code&gt; so configs, cron jobs, and workspace scripts are all live-editable from the host without rebuilding.&lt;/p&gt;

&lt;p&gt;On paper, this is elegant. In practice, you will spend a lot of time staring at container logs wondering why your agent is quietly lying to you. Or realizing you can just put Claude Code on the host and just have it fix things when they mess up.&lt;/p&gt;

&lt;h2 id=&quot;the-exec-approval-maze&quot;&gt;The Exec Approval Maze&lt;/h2&gt;

&lt;p&gt;Before Sabrina could run scripts, I had to configure &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exec-approvals.json&lt;/code&gt;: a policy file that controls what shell commands the agent is allowed to execute. Fine. Reasonable. I set up allowlists for the workspace scripts and Python interpreter.&lt;/p&gt;

&lt;p&gt;Then the cron jobs started silently failing. The daily 7am AI report would produce output, but something felt off. I dug into the exec-approval config and found the first trap:&lt;/p&gt;

&lt;p&gt;The documentation (and my own reasoning at the time) suggested &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;ask&quot;: &quot;never&quot;&lt;/code&gt; as a way to skip interactive approval prompts for unattended jobs. This is wrong. The schema only accepts &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;off&quot; | &quot;on-miss&quot; | &quot;always&quot;&lt;/code&gt;. Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;never&quot;&lt;/code&gt; doesn’t throw an error. It gets silently stripped by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sanitizeExecApprovalPolicy&lt;/code&gt; the next time the app writes the file. Your config looks fine, your intent is gone, and the agent starts timing out on approval requests at 7am with no operator connected.&lt;/p&gt;

&lt;p&gt;The correct pattern:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;defaults&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;security&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;allowlist&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;ask&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;off&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;allowlist&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;agents&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;main&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;security&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;allowlist&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;ask&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;off&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;allowlist&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;ask&quot;: &quot;off&quot;&lt;/code&gt; makes the allowlist the sole policy.&lt;/p&gt;

&lt;p&gt;I fixed this. Or so I thought.&lt;/p&gt;

&lt;h2 id=&quot;the-reports-that-were-too-good&quot;&gt;The Reports That Were Too Good&lt;/h2&gt;

&lt;p&gt;The AI intelligence report looked great. Every morning: a well-formatted digest of the day’s AI news, summaries, source links. Sabrina was crushing it.&lt;/p&gt;

&lt;p&gt;Then I noticed the timestamps.&lt;/p&gt;

&lt;p&gt;Every log entry in the fabricated reports had timestamps ending in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:00&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:30&lt;/code&gt;. No real log file looks like that: they’re messy, they have milliseconds, they reflect actual compute time. These were fake. I checked the URLs. Several of them 404’d. The article summaries were plausible but not verifiable. Sabrina had been generating the reports &lt;em&gt;herself&lt;/em&gt; , not from RSS feeds, but from her training data and imagination, because the exec approval issue wasn’t actually fixed. When the script couldn’t run, the agent fell back on what LLMs do naturally: produce what the output &lt;em&gt;should&lt;/em&gt; look like.&lt;/p&gt;

&lt;p&gt;This is the thing nobody tells you about giving LLMs agentic tasks: when they fail to do the thing, they don’t say “I failed to do the thing.” They generate a plausible simulation of having done the thing.&lt;/p&gt;

&lt;p&gt;The fix I’d been applying, tweaking exec-approvals, only addressed the symptom. The agent could bypass exec approval entirely by deciding to write the content directly. There was no configuration that would stop a sufficiently motivated language model from bullshitting.&lt;/p&gt;

&lt;h2 id=&quot;going-around-the-agent&quot;&gt;Going Around the Agent&lt;/h2&gt;

&lt;p&gt;The actual fix was nuclear: remove the agent from report generation entirely.&lt;/p&gt;

&lt;p&gt;I disabled the OpenClaw cron jobs for both the AI report and the email send, then added host-level cron entries that call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker exec&lt;/code&gt; directly:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;0 7 &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; docker &lt;span class=&quot;nb&quot;&gt;exec &lt;/span&gt;openclaw-openclaw-gateway-1 /usr/bin/python3 /home/node/.openclaw/workspace/ai_report.py &lt;span class=&quot;nt&quot;&gt;--profile&lt;/span&gt; ai-intelligence &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; /home/eristoddle/.openclaw/workspace/logs/report-host-&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;date&lt;/span&gt; +&lt;span class=&quot;se&quot;&gt;\%&lt;/span&gt;Y-&lt;span class=&quot;se&quot;&gt;\%&lt;/span&gt;m-&lt;span class=&quot;se&quot;&gt;\%&lt;/span&gt;d&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;.log 2&amp;gt;&amp;amp;1

30 7 &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; docker &lt;span class=&quot;nb&quot;&gt;exec &lt;/span&gt;openclaw-openclaw-gateway-1 bash /home/node/.openclaw/workspace/send-ai-intelligence-report-proper.sh &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; /home/eristoddle/.openclaw/workspace/logs/email-host-&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;date&lt;/span&gt; +&lt;span class=&quot;se&quot;&gt;\%&lt;/span&gt;Y-&lt;span class=&quot;se&quot;&gt;\%&lt;/span&gt;m-&lt;span class=&quot;se&quot;&gt;\%&lt;/span&gt;d&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;.log 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Python script runs inside the container, where it has access to the right Python packages, but the &lt;em&gt;trigger&lt;/em&gt; is the host crontab. No agent involved. No LLM between the script and reality.&lt;/p&gt;

&lt;p&gt;This works. The reports now have messy timestamps and real URLs that actually load.&lt;/p&gt;

&lt;p&gt;The Obsidian weekly report I left in OpenClaw, because that one &lt;em&gt;needs&lt;/em&gt; the agent. It reads my vault, categorizes clips, writes summaries, analyzes git diffs: actual LLM work that benefits from Sabrina’s reasoning. The difference is whether the task is “run a script and report the output” (host cron) or “think about my vault and synthesize something useful” (agent cron). Only one of those should involve an LLM.&lt;/p&gt;

&lt;h2 id=&quot;when-tools-become-literal-text&quot;&gt;When Tools Become Literal Text&lt;/h2&gt;

&lt;p&gt;OpenClaw gets updates. After updates, things break in interesting ways.&lt;/p&gt;

&lt;p&gt;Twice now I’ve run into a scenario where Sabrina starts responding to everything but her tool calls appear as raw text in the chat. Instead of actually reading a file, she’d output &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read:/home/node/.openclaw/workspace/HEARTBEAT.md&lt;/code&gt; as a literal string.&lt;/p&gt;

&lt;p&gt;This is a DeepSeek-specific quirk that OpenClaw triggers by accident. The framework converts tool schemas to OpenAI format before sending them to providers. DeepSeek expects its own native format. The conversion breaks its tool call parsing silently. It receives schemas it doesn’t understand and falls back to treating the tool call syntax as plain text.&lt;/p&gt;

&lt;p&gt;The fix is a compat flag in the model config in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;openclaw.json&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nl&quot;&gt;&quot;models&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;deepseek-chat&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;DeepSeek V3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;contextWindow&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;163840&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;maxTokens&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8192&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;compat&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;anthropicToolSchemaMode&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;native&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;anthropicToolSchemaMode: &quot;native&quot;&lt;/code&gt; tells OpenClaw to skip the schema conversion and send the native format. Tools work again. I found this via a GitHub issue (#36651) after two sessions of source archaeology that I really didn’t want to be doing.&lt;/p&gt;

&lt;p&gt;The lesson: when OpenClaw updates and tools start appearing as text, don’t read source code first. Check GitHub issues and Reddit. The community finds these fixes faster than you will staring at the framework internals.&lt;/p&gt;

&lt;h2 id=&quot;ripping-out-slack&quot;&gt;Ripping Out Slack&lt;/h2&gt;

&lt;p&gt;OpenClaw supports Slack via socket mode. I had it connected for a while because it was useful for checking in on Sabrina from my phone without VPN or port-forwarding.&lt;/p&gt;

&lt;p&gt;Then an update changed the Slack config schema. The gateway crashed on startup with “Config invalid” and wouldn’t come back up until I removed the entire &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;channels.slack&lt;/code&gt; block from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;openclaw.json&lt;/code&gt;. This happened twice. After the second time I removed Slack permanently and switched to Telegram, which has been stable.&lt;/p&gt;

&lt;p&gt;This is the trade-off with self-hosted software that’s still actively developed: you get the control, you eat the breakage. Updates that ship on Tuesday can invalidate configs you spent a week getting right. Having Claude Code manage the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.openclaw&lt;/code&gt; config directory directly, rather than asking Sabrina to fix herself through chat, means at least the fixes land correctly the first time.&lt;/p&gt;

&lt;h2 id=&quot;whats-actually-working&quot;&gt;What’s Actually Working&lt;/h2&gt;

&lt;p&gt;Six weeks in, here’s the honest status:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Daily AI intelligence report:&lt;/strong&gt; Running reliably via host cron. Real data. Real URLs. Emails delivered by 7:30am.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Weekly Obsidian report:&lt;/strong&gt; Agent-generated, delivers Fridays. Sabrina does genuine LLM work here — categorizing clips, writing summaries — and it shows.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tool calling:&lt;/strong&gt; Stable with the compat flag. Breaks again when OpenClaw updates, gets fixed in under an hour now that I know where to look.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;The exec-approvals file:&lt;/strong&gt; Still fragile. I keep a copy of the correct config in my notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing I underestimated: running an AI agent autonomously is mostly an infrastructure problem, not an AI problem. The interesting parts are the prompts and the LLM reasoning. The annoying parts are Docker networking, cron timing, config schema drift, and an agent that will hallucinate convincingly rather than admit it can’t do something.&lt;/p&gt;

&lt;p&gt;Sabrina’s useful. She’s also a liar when she’s backed into a corner. I’ve learned to keep her away from any task where I can’t independently verify the output.&lt;/p&gt;

&lt;p&gt;That’s not an OpenClaw problem or a DeepSeek problem. That’s just what LLMs do. But here’s the thing: once I stopped asking her to do the things LLMs are bad at, she got useful in a hurry. Most of what follows happened since last Thursday night.&lt;/p&gt;

&lt;h2 id=&quot;but-heres-what-shes-actually-good-at&quot;&gt;But Here’s What She’s Actually Good At&lt;/h2&gt;

&lt;p&gt;OpenClaw’s skill system is pluggable. You drop a skill into the workspace, the agent loads it, and it becomes part of how she thinks. Sabrina didn’t ship with most of her current capabilities. She built them through the same autonomous workflow she runs every day.&lt;/p&gt;

&lt;p&gt;A few that earn their slot:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sm-blog-outline&lt;/code&gt;&lt;/strong&gt;: Started life as a generic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blog-outline&lt;/code&gt; skill. Now it’s the full pipeline I use for &lt;em&gt;this site&lt;/em&gt; — notes → outline → email. Trained on my voice, my content pillars, my snark level. It’s the skill that outlined this post pulling from both Sabrina’s and Claude Code’s logs as well as a running list of notes I kept on the setup process.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ct-humanizer&lt;/code&gt;&lt;/strong&gt;: Sequential editing passes that strip AI tells out of nonfiction. Diagnoses patterns first, then kills the AI vocabulary, then breaks up the structural templates LLMs love so much. Not a magic button, more like a brutal copy editor. It cleans up the outline.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;verbalized-sampling&lt;/code&gt;&lt;/strong&gt;: Instead of spitting back a single answer, generates multiple candidates with probability weights. I use it for brainstorming and “show me five angles” tasks. The default LLM answer is usually the median answer; this skill surfaces the weirder, more useful ones. Got the idea &lt;a href=&quot;https://www.verbalized-sampling.com/&quot;&gt;here&lt;/a&gt;, gave Opus all the documentation, and used the Claude &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;skill-creator&lt;/code&gt; skill to create it. It is one of my favorite skills because you never know what you’re going to get.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vault-tag-search&lt;/code&gt;&lt;/strong&gt; + &lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vault-idea-scorer&lt;/code&gt;&lt;/strong&gt;: Companions to the blog pipeline. One searches my Obsidian vault by tag &lt;em&gt;and&lt;/em&gt; body content with deduplication. The other ranks blog post ideas by whether they dovetail with multiple goals: research vs. content vs. portfolio vs. SEO.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://clawhub.ai/ivangdavila/self-improving&quot;&gt;A self-improving skill&lt;/a&gt;&lt;/strong&gt;: Logs corrections and preferences so Sabrina compounds learning between sessions instead of getting the same feedback every week.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point isn’t any single skill. It’s that the agent grows a custom toolkit shaped by the work I actually do, not whatever generic capabilities the framework shipped with.&lt;/p&gt;

&lt;h3 id=&quot;the-report-engine-isnt-a-one-trick-pony&quot;&gt;The Report Engine Isn’t a One-Trick Pony&lt;/h3&gt;

&lt;p&gt;That &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ai_report.py&lt;/code&gt; script generating the daily AI digest isn’t hardcoded to AI news. It’s a topic-agnostic engine that takes a profile flag:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python3 ai_report.py &lt;span class=&quot;nt&quot;&gt;--profile&lt;/span&gt; ai-intelligence
python3 ai_report.py &lt;span class=&quot;nt&quot;&gt;--profile&lt;/span&gt; golang
python3 ai_report.py &lt;span class=&quot;nt&quot;&gt;--profile&lt;/span&gt; typescript
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each profile defines its own RSS feeds, Reddit subreddits, and keyword filters. Tunable depth too: brief briefing vs. deep dive, set per profile. Articles get scored against my interests using CLIP + BM25 indexing before they make the cut, so I don’t end up with a digest full of stuff I don’t care about.&lt;/p&gt;

&lt;p&gt;Same engine, different sources, same usefulness. Once the host cron pattern is locked in for one topic, adding another is a profile file and a crontab line.&lt;/p&gt;

&lt;h3 id=&quot;email-delivery-old-school-on-purpose&quot;&gt;Email Delivery, Old School On Purpose&lt;/h3&gt;

&lt;p&gt;Everything Sabrina produces comes to me as email. Gmail SMTP, app password auth…for now. Yes, that’s old fashioned. That’s the feature.&lt;/p&gt;

&lt;p&gt;A dashboard would be one more thing to check. Notifications would be one more app fighting for attention. Email is the universal inbox I already process. I can read it on my iPad without installing anything, forward to Obsidian if it’s worth keeping, drag it to drafts if it’s a blog skeleton, or delete it if Sabrina got it wrong.&lt;/p&gt;

&lt;p&gt;The pattern is generic:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;send-email.sh &lt;span class=&quot;s2&quot;&gt;&quot;Subject&quot;&lt;/span&gt; body-or-file &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;attachment]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That’s it. Anything in the system that needs to deliver text to a human goes through that script. Reports blog outlines, and research summaries use it.&lt;/p&gt;

&lt;h3 id=&quot;multi-model-not-locked-to-deepseek&quot;&gt;Multi-Model, Not Locked to DeepSeek&lt;/h3&gt;

&lt;p&gt;DeepSeek runs the daily cron work because it’s cheap. But Sabrina isn’t married to it. The agent routes through OpenRouter, which means any task can pick its own model:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen/qwen3.6-plus&lt;/code&gt;&lt;/strong&gt; — 1M context window, great for long-form research and generation&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minimax/minimax-m2.5&lt;/code&gt;&lt;/strong&gt; — strong reasoning, what I reach for on analytical work&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;google/gemini-3-flash-preview&lt;/code&gt;&lt;/strong&gt; — also 1M context, fast&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;moonshotai/kimi-k2.6&lt;/code&gt;&lt;/strong&gt; — solid alternative when the others are misbehaving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The job picks the model. Daily AI report? DeepSeek, because it’s cheap and the task isn’t hard. Blog outline that needs to chew through a pile of research notes? Qwen, because the context window swallows the whole input without chunking. Analytical synthesis? Minimax. And again, for now. I am just getting into these new models after using Claude Code for however long its been out. But the success I’ve have with them has me setting up Opencode to use them.&lt;/p&gt;

&lt;p&gt;The subagent system lets me parallelize too. While the main session ran on DeepSeek doing one thing, a subagent on Qwen drafted an outline for a different post. Two models, two tasks, one wall clock.&lt;/p&gt;

&lt;h3 id=&quot;the-track-record-three-days-in&quot;&gt;The Track Record, Three Days In&lt;/h3&gt;

&lt;p&gt;Concrete deliverables since Thursday night:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Blog outlines:&lt;/strong&gt; Two posts — a Kiro AI article and one I’m calling “The AI Psychologist” — both went notes → web research → verbalized sampling for angle selection → outline → email. Full pipeline, no me-in-the-loop until the outline showed up in my inbox.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Research tasks:&lt;/strong&gt; Author bios with structured JSON + bibliography, topic deep-dives on AI tools, vibe coding, prompt engineering psychology. Stuff I’d normally burn an afternoon on.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Brainstorming:&lt;/strong&gt; Content ideas, project names, productivity workflows, all using verbalized sampling so I get diverse options with probability weights instead of one safe median answer.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Memory compounding:&lt;/strong&gt; Daily logs roll up to weekly memory promotion. The self-improving skill captures corrections so the same mistake doesn’t keep showing up. Each week she’s a little less stupid about my preferences.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Weekly Obsidian reports:&lt;/strong&gt; Genuinely useful vault digests. What changed. What’s worth re-reading. What’s collecting dust and should be archived or thrown out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this involves Sabrina pretending to run scripts she can’t run. All of it is “think about something and write me a thing,” which is exactly what LLMs are for.&lt;/p&gt;

&lt;h2 id=&quot;what-i-actually-built&quot;&gt;What I Actually Built&lt;/h2&gt;

&lt;p&gt;Six weeks ago I wanted an autonomous AI agent. What I have now is better and stupider at the same time.&lt;/p&gt;

&lt;p&gt;The discovery, after all the silent hallucinations and config schema drift and tool-calls-as-text bullshit: AI agents are great at the &lt;em&gt;thinking&lt;/em&gt; parts: research, writing, brainstorming, synthesis. They’re terrible at the &lt;em&gt;doing&lt;/em&gt; parts: running scripts reliably, admitting they can’t do something, not making shit up when cornered.&lt;/p&gt;

&lt;p&gt;So I built around the doing and leaned into the thinking. Sabrina does real work now. She just doesn’t run the cron jobs herself anymore: the host crontab does. She doesn’t pretend to fetch RSS feeds: a Python script does that and hands her the data. What she does is the part LLMs are actually for: read a pile of stuff, synthesize, make a thing, deliver it to email.&lt;/p&gt;

&lt;p&gt;The host cron + agent hybrid is the pattern that actually ships. The agent is the writer, not the operator. The operator is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cron&lt;/code&gt; and a Python interpreter, both of which have been doing their jobs reliably since long before transformers were a thing.&lt;/p&gt;

&lt;p&gt;Six weeks to figure out what should have been obvious from the start: stop using language models for things that aren’t language. At least that’s what I’m going with until I have time to go through another continuous cycle of break then fix.&lt;/p&gt;
</description>
        <pubDate>Tue, 28 Apr 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/my-home-ai-agent-kept-making-shit-up/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/my-home-ai-agent-kept-making-shit-up/</guid>
        
        
        <category>ai-agents</category>
        
      </item>
    
      <item>
        <title>April 2026 Model Roundup: Opus 4.7 Official, DeepSeek V4 Open-Sources 1M Context, and GPT-5.5 Upstaged the GPT-6 Hype</title>
        <description>&lt;p&gt;Two weeks ago this month, developers discovered their Gemini API bills had exploded. Google’s billing system was charging for approximately 114 internal search queries per API call with grounding enabled. That was the story I started writing. By the time April 24 arrived, three new models had officially launched, the “Still Waiting for GPT-6” watch ended not with GPT-6 but with GPT-5.5, and DeepSeek V4 dropped today with a 1M context window under Apache 2.0, on the same day GPT-5.5 went live, apparently just to split the news cycle.&lt;/p&gt;

&lt;p&gt;This roundup covers April 2026 so far.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#when-google-billed-114x&quot; id=&quot;markdown-toc-when-google-billed-114x&quot;&gt;When Google Billed 114x&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-actually-moved-this-week&quot; id=&quot;markdown-toc-what-actually-moved-this-week&quot;&gt;What Actually Moved This Week&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#claude-opus-47-is-official--and-the-cost-story-is-better-than-expected&quot; id=&quot;markdown-toc-claude-opus-47-is-official--and-the-cost-story-is-better-than-expected&quot;&gt;Claude Opus 4.7 Is Official — and the Cost Story Is Better Than Expected&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#hype-check-mimo-v2-pro-one-month-in&quot; id=&quot;markdown-toc-hype-check-mimo-v2-pro-one-month-in&quot;&gt;Hype Check: Mimo V2 Pro, One Month In&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#kimi-k26-the-open-source-agentic-coding-model-nobody-covered&quot; id=&quot;markdown-toc-kimi-k26-the-open-source-agentic-coding-model-nobody-covered&quot;&gt;Kimi K2.6: The Open-Source Agentic Coding Model Nobody Covered&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#meta-broke-open-source-hearts&quot; id=&quot;markdown-toc-meta-broke-open-source-hearts&quot;&gt;Meta Broke Open Source Hearts&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-models-that-cost-almost-nothing-no-really&quot; id=&quot;markdown-toc-the-models-that-cost-almost-nothing-no-really&quot;&gt;The Models That Cost Almost Nothing (No, Really)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-hidden-tax-how-sonnet-46-can-still-cost-more-than-opus&quot; id=&quot;markdown-toc-the-hidden-tax-how-sonnet-46-can-still-cost-more-than-opus&quot;&gt;The Hidden Tax: How Sonnet 4.6 Can Still Cost More Than Opus&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#gpt-55-shipped-yesterday-not-gpt-6-and-deepseek-v4-dropped-today&quot; id=&quot;markdown-toc-gpt-55-shipped-yesterday-not-gpt-6-and-deepseek-v4-dropped-today&quot;&gt;GPT-5.5 Shipped Yesterday, Not GPT-6, and DeepSeek V4 Dropped Today&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-actual-takeaways&quot; id=&quot;markdown-toc-the-actual-takeaways&quot;&gt;The Actual Takeaways&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#read-for-yourself&quot; id=&quot;markdown-toc-read-for-yourself&quot;&gt;Read for Yourself&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;when-google-billed-114x&quot;&gt;When Google Billed 114x&lt;/h2&gt;

&lt;p&gt;Gemini 3 Flash Preview is Google’s high-volume, reasonably priced model: $0.50/M input, $3/M output, 1M context window. It’s been running at #4 on OpenRouter by weekly token volume. A lot of people have pipelines running on it. The “search grounding” feature, which lets the model query Google Search to ground its responses in real-time information, sounds great on paper.&lt;/p&gt;

&lt;p&gt;Turns out the billing for that feature had a misconfiguration. For every API call, users were being billed for roughly &lt;strong&gt;114 separate search queries&lt;/strong&gt; rather than the actual number of queries they used. The “Generate content search query Gemini 3” SKU in users’ dashboards was showing 10-15x the expected line items. Actual grounding call frequency had decreased, but bills exploded anyway.&lt;/p&gt;

&lt;p&gt;The scale of damage before Google caught it:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Multiple developers reporting 4x–10x cost increases on flat or declining usage&lt;/li&gt;
  &lt;li&gt;€1,000+ additional daily costs for at least one European developer&lt;/li&gt;
  &lt;li&gt;₩340,000 in two days for a Korean developer&lt;/li&gt;
  &lt;li&gt;Google identified the root cause on April 14, committed to fixing the misconfiguration and correcting previous bills&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Update as of April 24:&lt;/strong&gt; Google engineer Ali Cevik confirmed on the developer forum that the billing misconfiguration is fixed going forward. Refunds are being processed, but Google has provided no specific timeline. Forum responses from their team said “by the end of the month” without committing to anything more specific. Affected users are reporting that support is framing corrections as “one-time exceptions” rather than acknowledging the systemic bug. Re-enabling grounding is probably safe now for new calls, but check your billing dashboard before turning it back on, and watch the first few days’ charges carefully.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://discuss.ai.google.dev/t/sudden-cost-spike-with-gemini-3-flash-preview-despite-decreased-usage-april-2026/139138&quot;&gt;thread&lt;/a&gt; on the Google AI Developers Forum is worth reading if you’re running anything on Gemini 3 Flash with grounding enabled. The concrete lesson here: &lt;strong&gt;search grounding is billed separately from token usage&lt;/strong&gt;, and before you enable any “enhanced” feature on a high-volume model, understand exactly what gets metered and how. Don’t assume the main pricing page tells the whole story.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2026/april-2026-openrouter-model-ranking.png&quot; alt=&quot;April 2026 LLM Model Ranking&quot; srcset=&quot;            /assets/resized/480/april-2026-openrouter-model-ranking.png 480w,            /assets/resized/800/april-2026-openrouter-model-ranking.png 800w,            /assets/resized/1400/april-2026-openrouter-model-ranking.png 1400w,    &quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;what-actually-moved-this-week&quot;&gt;What Actually Moved This Week&lt;/h2&gt;

&lt;p&gt;Here’s the OpenRouter picture as of the week ending April 24:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Rank&lt;/th&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Provider&lt;/th&gt;
      &lt;th&gt;Weekly Tokens&lt;/th&gt;
      &lt;th&gt;WoW Change&lt;/th&gt;
      &lt;th&gt;Arena Overall&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;Anthropic&lt;/td&gt;
      &lt;td&gt;1.38T&lt;/td&gt;
      &lt;td&gt;+3%&lt;/td&gt;
      &lt;td&gt;#3 (1496)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;DeepSeek V3.2&lt;/td&gt;
      &lt;td&gt;DeepSeek&lt;/td&gt;
      &lt;td&gt;1.32T&lt;/td&gt;
      &lt;td&gt;+3%&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;Gemini 3 Flash Preview&lt;/td&gt;
      &lt;td&gt;Google&lt;/td&gt;
      &lt;td&gt;1.11T&lt;/td&gt;
      &lt;td&gt;stable&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Anthropic&lt;/td&gt;
      &lt;td&gt;951B&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;+4,221%&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;#1 (1503, thinking)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;Mimo V2 Pro&lt;/td&gt;
      &lt;td&gt;Xiaomi&lt;/td&gt;
      &lt;td&gt;902B&lt;/td&gt;
      &lt;td&gt;+9%&lt;/td&gt;
      &lt;td&gt;not yet&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;MiniMax M2.5&lt;/td&gt;
      &lt;td&gt;Minimax&lt;/td&gt;
      &lt;td&gt;856B&lt;/td&gt;
      &lt;td&gt;+22%&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;MiniMax M2.7&lt;/td&gt;
      &lt;td&gt;Minimax&lt;/td&gt;
      &lt;td&gt;813B&lt;/td&gt;
      &lt;td&gt;+24%&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Kimi K2.6&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Moonshot AI&lt;/td&gt;
      &lt;td&gt;792B&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;New&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;not yet&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;Claude Opus 4.6&lt;/td&gt;
      &lt;td&gt;Anthropic&lt;/td&gt;
      &lt;td&gt;756B&lt;/td&gt;
      &lt;td&gt;+46%&lt;/td&gt;
      &lt;td&gt;#2 (1503, thinking)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Grok 4.1 Fast&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;X.AI&lt;/td&gt;
      &lt;td&gt;700B&lt;/td&gt;
      &lt;td&gt;+33%&lt;/td&gt;
      &lt;td&gt;—&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Three new entries this week: Claude Opus 4.7 at #4 with a 4,221% spike, Kimi K2.6 debuting at #8 on its first week, and Grok 4.1 Fast at #10. Claude Opus 4.6, which was briefly the second-most-used model, dropped to #9 as people migrated to 4.7.&lt;/p&gt;

&lt;p&gt;The stable story continues in the background: Claude Sonnet 4.6 and DeepSeek V3.2 are running neck and neck at the top, both at a slow +3% WoW. That’s real production traffic, not evaluation runs.&lt;/p&gt;

&lt;h2 id=&quot;claude-opus-47-is-official--and-the-cost-story-is-better-than-expected&quot;&gt;Claude Opus 4.7 Is Official — and the Cost Story Is Better Than Expected&lt;/h2&gt;

&lt;p&gt;Anthropic launched Claude Opus 4.7 on April 16. It’s been sitting in Arena’s blind comparison system for a few weeks, and it’s now publicly available on the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.&lt;/p&gt;

&lt;p&gt;Pricing: &lt;strong&gt;$5/M input, $25/M output&lt;/strong&gt; — unchanged from Opus 4.6. That’s the headline.&lt;/p&gt;

&lt;p&gt;The real story is what the model does to your costs in practice. Artificial Analysis ran Opus 4.7 through their GDPVal-AA benchmark suite (44 occupations, 9 industries) and found it uses roughly &lt;strong&gt;35% fewer output tokens than Opus 4.6&lt;/strong&gt; to complete the same tasks. The practical effect: real-world costs on Opus 4.7 run approximately 11% lower than Opus 4.6 at the same stated price per token.&lt;/p&gt;

&lt;p&gt;There’s a caveat on the input side. The 4.7 tokenizer is less efficient, generating up to 35% more tokens from the same input text depending on content type. For workloads with heavy, repeated system prompts or long document context, this can offset some of the output savings. Prompt caching (available at roughly 10% of the input rate) largely neutralizes this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance numbers that matter:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Artificial Analysis Intelligence Index: &lt;strong&gt;57&lt;/strong&gt; (up 4 points from Opus 4.6, tied with GPT-5.4 and Gemini 3.1 Pro)&lt;/li&gt;
  &lt;li&gt;GDPVal-AA: &lt;strong&gt;1,753 Elo&lt;/strong&gt; — 79 points ahead of the next model on real-world knowledge work&lt;/li&gt;
  &lt;li&gt;Hallucination rate: &lt;strong&gt;36%&lt;/strong&gt; (down from 61% on Opus 4.6, achieved through more frequent abstention)&lt;/li&gt;
  &lt;li&gt;Arena: &lt;strong&gt;#1 tied&lt;/strong&gt; at 1503 Elo (with thinking mode), #4 at 1494 without thinking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 4,221% WoW spike on OpenRouter is a curiosity spike plus a migration wave from people moving from 4.6. By next week you’ll see whether it settles into stable sustained usage or was just upgrade traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New cybersecurity guardrails:&lt;/strong&gt; Anthropic added automatic detection and blocking for prohibited cybersecurity uses. Security professionals doing legitimate work (pen testing, vuln research, red-teaming) need to join their new Cyber Verification Program to preserve access to those capabilities on 4.7.&lt;/p&gt;

&lt;h2 id=&quot;hype-check-mimo-v2-pro-one-month-in&quot;&gt;Hype Check: Mimo V2 Pro, One Month In&lt;/h2&gt;

&lt;p&gt;Mimo V2 Pro shot up 140% WoW in the April 14 data. Now, with another week of data, it’s at +9%. The spike is over and it’s settling into a real usage tier.&lt;/p&gt;

&lt;p&gt;Xiaomi’s flagship foundation model: over 1 trillion total parameters, 42 billion active (MoE architecture), $1/M input, $3/M output, 1 million token context window. Benchmarks put it at 49 on the Artificial Analysis Intelligence Index.&lt;/p&gt;

&lt;p&gt;The +140% spike was the evaluation-and-curiosity phase. The +9% continuing growth suggests people who ran it liked it enough to keep using it. Still no Arena votes worth analyzing. At ~5 weeks old, the model hasn’t been around long enough for production validation at scale.&lt;/p&gt;

&lt;p&gt;Check back in 2–3 more weeks. If it accumulates Arena votes and holds a respectable position there, the benchmarks were real. Stable OpenRouter usage without Arena presence is ambiguous: it could mean quality users who prefer specific capabilities, or it could mean low-friction API access driving test traffic.&lt;/p&gt;

&lt;h2 id=&quot;kimi-k26-the-open-source-agentic-coding-model-nobody-covered&quot;&gt;Kimi K2.6: The Open-Source Agentic Coding Model Nobody Covered&lt;/h2&gt;

&lt;p&gt;Moonshot AI released Kimi K2.6 on April 20 and it debuted at #8 on OpenRouter in its first week. You probably missed it because the same week had Claude Opus 4.7’s official launch, Grok 4.3 Beta, and the GPT-5.5 pre-announcement noise.&lt;/p&gt;

&lt;p&gt;What it is: a 1-trillion-parameter MoE model with 32B active parameters, &lt;strong&gt;262,144-token context window&lt;/strong&gt;, vision, and agentic capabilities. Weights published on Hugging Face under a &lt;strong&gt;Modified MIT License&lt;/strong&gt;: full open weights, commercially usable.&lt;/p&gt;

&lt;p&gt;What it’s built for: long-horizon coding agents, front-end generation from natural language, and massively parallel agent swarms. Moonshot’s documentation specifically highlights scaling to 300 sub-agents and 4,000 coordinated steps in a single session. If you’re building orchestration-heavy multi-agent systems, this is the open-weight model that was designed from the ground up for that use case.&lt;/p&gt;

&lt;p&gt;Benchmark comparisons are mixed but solid. On SWE-Bench Pro it outperforms DeepSeek V4-Pro (58.6 vs 55.4). On LiveCodeBench it trails V4-Pro (89.6 vs 93.5). On competitive coding (Codeforces), both trail GPT-5.5.&lt;/p&gt;

&lt;p&gt;No pricing table yet because it’s primarily a self-hosted model. Kimi API pricing for hosted inference isn’t broadly published yet. For the open-weights version: the cost is your inference infrastructure. A 32B-active MoE runs reasonably on mid-tier GPU setups.&lt;/p&gt;

&lt;p&gt;DeepSeek V4 (more on that below) is the stronger model by most closed benchmarks. But Kimi K2.6 has the context window advantage (262K vs 1M for V4-Pro — actually V4-Pro wins there), and the MIT-derived license is cleaner than Apache 2.0 for certain commercial use cases.&lt;/p&gt;

&lt;h2 id=&quot;meta-broke-open-source-hearts&quot;&gt;Meta Broke Open Source Hearts&lt;/h2&gt;

&lt;p&gt;Llama made Meta relevant in the AI developer world. Open weights, commercial use, the whole deal. Llama 4 dropped in 2025 with a 10 million token context window and impressive parameter counts. The developer community built on it. People ran it locally, fine-tuned it, deployed it. Meta was the company that understood that open source was ecosystem building.&lt;/p&gt;

&lt;p&gt;Then on April 8, Muse Spark dropped from Meta’s new “Superintelligence Labs.” Proprietary model. Not open weights. API in private preview. To try it on the web, you need a Facebook or Instagram login.&lt;/p&gt;

&lt;p&gt;Meta went from an Artificial Analysis Index score of 18 with Llama 4 Maverick to 52 with Muse Spark. That’s not a modest improvement. And in Arena’s head-to-head voting, Muse Spark is sitting at #6 overall with an Elo of 1492, beating GPT-5.4-high in actual user preference votes.&lt;/p&gt;

&lt;p&gt;So the model is legitimately good. &lt;strong&gt;As of April 24, the API remains private preview only: no public access, no announced pricing, no timeline for broader availability.&lt;/strong&gt; Priority access is going to healthcare, education, and enterprise research partners. If you’re building something that needs Muse Spark today, you’re waiting.&lt;/p&gt;

&lt;p&gt;“Meta learned from OpenAI: make the good stuff closed, give the community the crumbs.” I’ve been seeing that take everywhere this month, and I don’t think it’s entirely wrong.&lt;/p&gt;

&lt;p&gt;The broader question this raises: if every lab eventually closes off its best models, what’s the long-term roadmap for building on open weights? DeepSeek and Moonshot are still playing the open-source game. Kimi K2.6 is MIT-licensed. And DeepSeek V4 dropped today with Apache 2.0 weights on Hugging Face. The pattern is becoming hard to ignore, but there are still holdouts.&lt;/p&gt;

&lt;h2 id=&quot;the-models-that-cost-almost-nothing-no-really&quot;&gt;The Models That Cost Almost Nothing (No, Really)&lt;/h2&gt;

&lt;p&gt;I need to talk about MiniMax M2.5 because I’ve been mentioning it in passing and it deserves its own paragraph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$0.118 per million input tokens.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s twelve cents per million tokens. For a model that’s sitting at #6 on OpenRouter by weekly volume and growing at 22% WoW. For a model that scores 80.2% on SWE-Bench Verified: which is roughly what Claude’s flagship hits. With a 196,608 token context window. And it’s good enough at agentic tasks that it’s been called out repeatedly in the Latent.Space local model community as the go-to for tool-heavy applications.&lt;/p&gt;

&lt;p&gt;The pricing table this week:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input $/1M&lt;/th&gt;
      &lt;th&gt;Output $/1M&lt;/th&gt;
      &lt;th&gt;Context&lt;/th&gt;
      &lt;th&gt;Notes&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.7&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
      &lt;td&gt;$25.00&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
      &lt;td&gt;Arena #1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
      &lt;td&gt;#1 volume on OR&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.5&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
      &lt;td&gt;$30.00&lt;/td&gt;
      &lt;td&gt;2M&lt;/td&gt;
      &lt;td&gt;Tops AA Index at 60&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.5 Pro&lt;/td&gt;
      &lt;td&gt;$30.00&lt;/td&gt;
      &lt;td&gt;$180.00&lt;/td&gt;
      &lt;td&gt;2M&lt;/td&gt;
      &lt;td&gt;Research tier&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V4-Pro&lt;/td&gt;
      &lt;td&gt;$1.74&lt;/td&gt;
      &lt;td&gt;$3.48&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
      &lt;td&gt;Apache 2.0, released today&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V4-Flash&lt;/td&gt;
      &lt;td&gt;$0.14&lt;/td&gt;
      &lt;td&gt;$0.28&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
      &lt;td&gt;Apache 2.0, released today&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek V3.2&lt;/td&gt;
      &lt;td&gt;$0.259&lt;/td&gt;
      &lt;td&gt;$0.42&lt;/td&gt;
      &lt;td&gt;163K&lt;/td&gt;
      &lt;td&gt;3-mo validated, #2 on OR&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MiniMax M2.5&lt;/td&gt;
      &lt;td&gt;$0.118&lt;/td&gt;
      &lt;td&gt;$0.99&lt;/td&gt;
      &lt;td&gt;196K&lt;/td&gt;
      &lt;td&gt;80.2% SWE-Bench&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MiniMax M2.7&lt;/td&gt;
      &lt;td&gt;$0.30&lt;/td&gt;
      &lt;td&gt;$1.20&lt;/td&gt;
      &lt;td&gt;196K&lt;/td&gt;
      &lt;td&gt;Upgraded M2.5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Mimo V2 Pro&lt;/td&gt;
      &lt;td&gt;$1.00&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
      &lt;td&gt;Settling into usage&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 3 Flash&lt;/td&gt;
      &lt;td&gt;$0.50&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;1M&lt;/td&gt;
      &lt;td&gt;Grounding: proceed cautiously&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;When MiniMax M2.5 matches or beats Sonnet on SWE-Bench at roughly 1/25th the per-token cost, we’re in strange territory. Either the benchmark is missing something important about real-world usability, or there’s value being left on the table by anyone running default Claude endpoints on agentic coding tasks without at least testing alternatives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My actual picks this week, by use case:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Budget, coding/agentic&lt;/strong&gt;: &lt;strong&gt;DeepSeek V4-Flash&lt;/strong&gt; at $0.14/M: just dropped today, open source. Test it immediately. MiniMax M2.5 at $0.118/M is still the safety pick if you want community-validated quality.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Budget, general&lt;/strong&gt;: DeepSeek V3.2. $0.42/M output, three months of community validation, strong on math and code. Nothing changed here.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Balanced&lt;/strong&gt;: DeepSeek V4-Pro at $1.74/M input, $3.48/M output with 1M context. Undercuts everything at this quality tier by a factor of 5-8x.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Premium, coding&lt;/strong&gt;: Claude Sonnet 4.6 or Claude Opus 4.7 depending on your task complexity and whether the token economics work out (see the next section).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;If you have to have the absolute best&lt;/strong&gt;: Claude Opus 4.7 with thinking (Arena #1, 1503 Elo) or GPT-5.5 (AA Index #1 at 60). Accept the pricing gap vs open-source alternatives.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-hidden-tax-how-sonnet-46-can-still-cost-more-than-opus&quot;&gt;The Hidden Tax: How Sonnet 4.6 Can Still Cost More Than Opus&lt;/h2&gt;

&lt;p&gt;Claude Sonnet 4.6 is marketed as the economical alternative to Opus. It’s $3/M input versus Opus 4.7’s $5/M: a modest 1.67x difference on input. But that’s not where your money goes on agentic workloads.&lt;/p&gt;

&lt;p&gt;On the Artificial Analysis GDPVal-AA benchmark, Sonnet 4.6 generates &lt;strong&gt;4.5x more output tokens&lt;/strong&gt; than Opus 4.6 to complete the same tasks. The model isn’t worse. It’s producing more intermediate reasoning, more scaffolding, more steps. But output tokens are what you pay for.&lt;/p&gt;

&lt;p&gt;The math with correct current pricing: Sonnet 4.6 at 4.5x tokens × $15/M output = &lt;strong&gt;$67.50/M effective output cost&lt;/strong&gt; versus Opus 4.7 at $25/M output. Sonnet costs 2.7x more per equivalent task in heavy agentic use.&lt;/p&gt;

&lt;p&gt;The practical takeaway: if you’re running document summarization, one-shot Q&amp;amp;A, light code generation, Sonnet 4.6 is cheaper and you should use it. If you’re running agentic pipelines, &lt;a href=&quot;https://www.stephanmiller.com/my-home-ai-agent-kept-making-shit-up/&quot;&gt;autonomous coding agents&lt;/a&gt;, extended tool-use workflows: &lt;strong&gt;benchmark on your actual workload before you assume Sonnet saves money&lt;/strong&gt;. The pricing page isn’t lying; the intuitive comparison probably is.&lt;/p&gt;

&lt;p&gt;And now there’s a third option in the mix: Opus 4.7, which uses ~35% fewer output tokens than Opus 4.6 at the same $25/M rate. For heavy agentic use, Opus 4.7 may be the cheapest of the three Anthropic options. Run your own numbers.&lt;/p&gt;

&lt;h2 id=&quot;gpt-55-shipped-yesterday-not-gpt-6-and-deepseek-v4-dropped-today&quot;&gt;GPT-5.5 Shipped Yesterday, Not GPT-6, and DeepSeek V4 Dropped Today&lt;/h2&gt;

&lt;p&gt;OpenAI finished pre-training the model codenamed “Spud” on March 24. An April 14 release date came and went with nothing. Then on April 23, OpenAI shipped &lt;strong&gt;GPT-5.5&lt;/strong&gt; — their most capable model to date and, per their description, the first fully retrained base since GPT-4.5.&lt;/p&gt;

&lt;p&gt;It’s not GPT-6. But it’s real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5.5 numbers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index: 60&lt;/strong&gt; — three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, both at 57&lt;/li&gt;
  &lt;li&gt;Terminal-Bench 2.0: 82.7% (vs 75.1% for GPT-5.4)&lt;/li&gt;
  &lt;li&gt;Expert-SWE: 73.1% (vs 68.5% for GPT-5.4)&lt;/li&gt;
  &lt;li&gt;Pricing: &lt;strong&gt;$5/M input, $30/M output&lt;/strong&gt; — double the cost of GPT-5.4 on output&lt;/li&gt;
  &lt;li&gt;GPT-5.5 Pro tier: $30/M input, $180/M output (research/enterprise)&lt;/li&gt;
  &lt;li&gt;Context window: 2M tokens (1M longer than most competitors)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 40% reduction in output token usage that OpenAI claims keeps the effective cost increase to roughly 20% despite the doubled price per token. That math depends entirely on your workload matching the benchmark profile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now the same-day plot twist:&lt;/strong&gt; On April 24 — today — &lt;strong&gt;DeepSeek V4&lt;/strong&gt; dropped with open weights under Apache 2.0.&lt;/p&gt;

&lt;p&gt;DeepSeek V4-Pro: 1.6T total parameters, 49B active (MoE), 1M context window, $1.74/M input, $3.48/M output. V4-Pro output is &lt;strong&gt;8.6x cheaper than GPT-5.5&lt;/strong&gt; and &lt;strong&gt;21x cheaper than Claude Opus 4.7&lt;/strong&gt; at stated per-token rates.&lt;/p&gt;

&lt;p&gt;DeepSeek V4-Flash: 284B total parameters, 13B active, 1M context, $0.14/M input, $0.28/M output.&lt;/p&gt;

&lt;p&gt;Both variants under Apache 2.0, weights on Hugging Face and ModelScope today.&lt;/p&gt;

&lt;p&gt;Performance on competitive coding (Codeforces): V4-Pro scores 3,206 vs GPT-5.5’s 3,168 — V4-Pro wins. On SWE-Bench Pro, Kimi K2.6 beats V4-Pro (58.6 vs 55.4). On long-context retrieval (MRCR 1M), Claude Opus 4.6 beats V4-Pro (92.9 vs 83.5). So V4-Pro isn’t universally better — but at $3.48/M output vs $25-30/M for closed alternatives, it doesn’t need to be universally better to be the right answer for most workloads.&lt;/p&gt;

&lt;p&gt;GPT-6 “Spud”: still hasn’t arrived. Polymarket has it at 72% by April 30 and 95%+ by June 30. At this point I’ll believe it when I see it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Other things still in the pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Claude Mythos Preview&lt;/strong&gt;: still available only to approximately 50 partner organizations since April 7. Cybersecurity focus. $25/M input, $125/M output. Nothing changed here.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Grok 4.3 Beta&lt;/strong&gt;: dropped April 17 with native video understanding, PDF/PowerPoint generation, and enhanced long-context processing. Not yet on OpenRouter broadly. Still in xAI testing phase.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-actual-takeaways&quot;&gt;The Actual Takeaways&lt;/h2&gt;

&lt;p&gt;April 2026 shipped more major model releases than any previous month in AI history, and then DeepSeek V4 and GPT-5.5 both dropped on the same day at the end of it. The landscape looks different today than it did two weeks ago.&lt;/p&gt;

&lt;p&gt;What actually matters as of April 24:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Gemini 3 Flash grounding billing is fixed going forward&lt;/strong&gt; but check your billing dashboard before re-enabling, and watch the first few days’ charges carefully. Refunds are in process; don’t expect speed on that.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;DeepSeek V4 just dropped open-source with 1M context and Apache 2.0.&lt;/strong&gt; V4-Flash at $0.14/$0.28 and V4-Pro at $1.74/$3.48. Test it today. It’s too new for community validation but the pedigree is real and the pricing is absurd for the quality tier.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;MiniMax M2.5 at $0.118/M and 80.2% SWE-bench is still the community-validated budget pick for agentic coding.&lt;/strong&gt; Three weeks of steady usage volume with no hype cycle. DeepSeek V4-Flash is the new challenger — if validation holds over the next few weeks, it may displace M2.5.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Claude Opus 4.7 is the easiest top-tier upgrade you’ll make this month.&lt;/strong&gt; Same price as 4.6, 35% fewer output tokens, Arena #1. If you’re running Opus 4.6 today, just switch.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Benchmark Sonnet 4.6 vs Opus 4.7 on your actual agentic workloads.&lt;/strong&gt; Opus 4.7’s improved token efficiency means the economics may favor it over Sonnet for complex agent tasks. Run the math on your usage before assuming Sonnet is cheaper.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Mimo V2 Pro and Kimi K2.6 need another 2-3 weeks.&lt;/strong&gt; Both show real usage momentum. Neither has Arena data yet. Hold the investment thesis pending community validation.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;GPT-5.5 topped the Artificial Analysis Intelligence Index at 60.&lt;/strong&gt; That matters, but at $30/M output you’re paying a substantial premium over DeepSeek V4-Pro ($3.48/M) for about 3 points on a benchmark. Evaluate whether that delta maps to your actual workload before committing.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model evaluation cycle for “is this the right choice?” is now measured in weeks, not quarters&lt;/p&gt;

&lt;h2 id=&quot;read-for-yourself&quot;&gt;Read for Yourself&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://discuss.ai.google.dev/t/sudden-cost-spike-with-gemini-3-flash-preview-despite-decreased-usage-april-2026/139138&quot;&gt;Gemini 3 Flash billing bug thread&lt;/a&gt;&lt;/strong&gt; — r/[Google AI Dev Forum] — Developer discussion of the billing disaster, with cost breakdowns and screenshots&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://techcrunch.com/2026/04/08/meta-debuts-the-muse-spark-model-in-a-ground-up-overhaul-of-its-ai/&quot;&gt;Meta introduces Muse Spark&lt;/a&gt;&lt;/strong&gt; — TechCrunch — The story of Meta’s open-source pivot; comment threads are heated&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://www.latent.space/p/ainews-top-local-models-list-april&quot;&gt;Top Local Models List April 2026&lt;/a&gt;&lt;/strong&gt; — Latent.Space — Community-validated rankings for open-weight models&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://decrypt.co/362633/xiaomi-mimo-v2-pro-review-so-good-mistaken-deepseek-v4&quot;&gt;Mimo V2 Pro: mistaken for DeepSeek V4&lt;/a&gt;&lt;/strong&gt; — Decrypt — The review that captures the week’s Xiaomi surprise&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://awesomeagents.ai/reviews/review-claude-sonnet-4-6/&quot;&gt;Claude Sonnet 4.6: the workhorse that ate the flagship&lt;/a&gt;&lt;/strong&gt; — AwesomeAgents — Honest multi-week review with the token cost caveat&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Fri, 24 Apr 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/april-2026-model-roundup-the-billing-horror-the-012m-unicorn-and-metas-open-source-betrayal/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/april-2026-model-roundup-the-billing-horror-the-012m-unicorn-and-metas-open-source-betrayal/</guid>
        
        
        <category>large-language-models</category>
        
      </item>
    
      <item>
        <title>Microsoft APM - Managing AI Context Like a Dependency Problem</title>
        <description>&lt;p&gt;It started with a small problem that wouldn’t stop nagging me.&lt;/p&gt;

&lt;p&gt;I had AI coding tools scattered across machines, each one configured slightly differently, each one producing slightly different results. My Claude Code setup on my laptop didn’t match my desktop. I had created skills for these coding agents, but the skills I could use depended on which machine I was using. It had a “ it works on my machine” issue, but they were all my machines.&lt;/p&gt;

&lt;p&gt;I started using &lt;a href=&quot;https://github.com/runkids/skillshare&quot;&gt;Skillshare&lt;/a&gt; a little while ago and it helps somewhat, but it focuses on syncing the skills between the coding agents configs in your user folder. This type of functionality is useful for some skills, but not for all of them, Because sometimes you only need skills at the repo level. And putting all your coding agent skills at the user folder level not only pollutes your context, but makes it hard to find a specific skill when you want one.&lt;/p&gt;

&lt;p&gt;So when I was asked to look for an enterprise tool to manage skills with a focus on Github Copilot, I found &lt;a href=&quot;https://github.com/microsoft/apm/blob/main/README.md&quot;&gt;Microsoft APM&lt;/a&gt;, Agent Package Manager.&lt;/p&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#the-blueprint-apms-declarative-infrastructure&quot; id=&quot;markdown-toc-the-blueprint-apms-declarative-infrastructure&quot;&gt;The Blueprint: APM’s Declarative Infrastructure&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#organizing-the-monorepo&quot; id=&quot;markdown-toc-organizing-the-monorepo&quot;&gt;Organizing the Monorepo&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#solving-context-pollution-with-intelligent-compilation&quot; id=&quot;markdown-toc-solving-context-pollution-with-intelligent-compilation&quot;&gt;Solving Context Pollution with Intelligent Compilation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#automating-the-standards&quot; id=&quot;markdown-toc-automating-the-standards&quot;&gt;Automating the Standards&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#local-iteration-and-the-playground-strategy&quot; id=&quot;markdown-toc-local-iteration-and-the-playground-strategy&quot;&gt;Local Iteration and the Playground Strategy&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#from-supervisor-to-architect&quot; id=&quot;markdown-toc-from-supervisor-to-architect&quot;&gt;From Supervisor to Architect&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-blueprint-apms-declarative-infrastructure&quot;&gt;The Blueprint: APM’s Declarative Infrastructure&lt;/h2&gt;

&lt;p&gt;Many of us are still prompting it like it’s 2023. Copy-paste a prompt. Maybe drop a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt;  or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AGENTS.md&lt;/code&gt; in the repo. Hope for the best. When the AI does something dumb, yell at it in the chat window and hope it remembers next time. It won’t.&lt;/p&gt;

&lt;p&gt;APM replaces all of that with a declarative, version-locked workflow that treats AI context the same way we treat dependencies. You declare what you need in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm.yml&lt;/code&gt;, lock it with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm.lock.yaml&lt;/code&gt;, and install it with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm install&lt;/code&gt;. If that sounds like npm or pip, good. That’s the point. We solved dependency management for code twenty years ago. It’s insane that we’re still managing AI context by hand.&lt;/p&gt;

&lt;p&gt;The system is built around seven primitives:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Instructions&lt;/strong&gt; — The guardrails. Think &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AGENTS.md&lt;/code&gt; files that tell the AI how to behave.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Skills&lt;/strong&gt; — Reusable capabilities the AI can invoke.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Prompts&lt;/strong&gt; — Executable task templates with defined inputs. Called commands in Claude Code.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Agents&lt;/strong&gt; — Specialized sub-agents with their own instructions and tools.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hooks&lt;/strong&gt; — Shell commands that fire on specific events.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Plugins&lt;/strong&gt; — Extensions that add functionality to the agent runtime.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;MCP Servers&lt;/strong&gt; — Model Context Protocol servers that give agents access to external tools and data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The part that sold me: APM doesn’t run a daemon or require a runtime. It populates your existing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.github/&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.claude/&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.cursor/&lt;/code&gt; folders with native configuration files. The agents just pick them up. If you delete APM tomorrow, those files still work. Zero lock-in. That’s how you know someone thought about this for more than a weekend.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://microsoft.github.io/apm/key-concepts/&quot;&gt;Key Concepts Guide&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;organizing-the-monorepo&quot;&gt;Organizing the Monorepo&lt;/h2&gt;

&lt;p&gt;So you’ve got seven types of primitives and you want to share them across multiple projects. Maybe across a whole team. Maybe across an entire engineering organization. You need structure, or you’ll drown in conflicting instructions and duplicated skills within a month. For now, this works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Department&lt;/strong&gt; — The top layer. These are your organization-wide standards. Security policies. Code review requirements. Compliance guardrails. The stuff that applies everywhere and nobody gets to opt out of. Think of it like your company’s engineering handbook, except the AI actually reads it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Team&lt;/strong&gt; — The middle layer. Your team’s specializations. Maybe your frontend team has specific React patterns. Your data team has dbt conventions. Your platform team has infrastructure standards. These inherit from Department but add domain-specific knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project&lt;/strong&gt; — The bottom layer. Local context for a specific repo. The stuff that only matters here. Your project’s architecture decisions, custom tooling, specific quirks.&lt;/p&gt;

&lt;p&gt;In practice, this lives in a monorepo where each layer is a directory containing &lt;strong&gt;virtual subdirectory packages&lt;/strong&gt;. So you might have:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;/department/standards-security
/department/standards-code-review
/team/frontend-react
/team/data-engineering
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each of those is a standalone APM package that can be versioned and depended on independently, but they all live in one repo where you can see the whole picture. You slap &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CODEOWNERS&lt;/code&gt; on the department folders so nobody changes the security standards without review, but teams get autonomy over their own specializations.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://microsoft.github.io/apm/guides/org-packages/&quot;&gt;Org-Wide Packages Pattern&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;solving-context-pollution-with-intelligent-compilation&quot;&gt;Solving Context Pollution with Intelligent Compilation&lt;/h2&gt;

&lt;p&gt;Here’s a problem I didn’t anticipate until I was neck-deep in it: &lt;strong&gt;context pollution&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You’ve got department-level instructions. Team-level instructions. Project-level instructions. Skills from three different packages. Prompts from two more. And now your AI assistant is trying to load all of that into a context window that is not infinite. Irrelevant instructions don’t just waste tokens and degrade performance. Tell an AI too many things and it starts forgetting the important ones.&lt;/p&gt;

&lt;p&gt;APM solves this with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm compile&lt;/code&gt;, which transforms all your scattered primitives into optimized, hierarchical &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AGENTS.md&lt;/code&gt; files. It figures out which instructions belong at which level and how to structure them so the AI gets the most relevant context first.&lt;/p&gt;

&lt;p&gt;The conflict resolution model is opinionated: &lt;strong&gt;local project files always win&lt;/strong&gt;. If your project has an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AGENTS.md&lt;/code&gt; and an installed package also has one, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm install&lt;/code&gt; skips the existing file unless you explicitly &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--force&lt;/code&gt; it. During &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm compile&lt;/code&gt;, instructions get merged intelligently based on file patterns, but your local overrides stay on top. This is the right call. The project knows itself better than any upstream package does.&lt;/p&gt;

&lt;p&gt;I found this out the hard way, naturally. I had a package that defined broad coding standards and a project that had specific exceptions. Without the compile step, the AI was getting contradictory instructions and doing that thing where it apologizes and asks which rule you’d prefer it follow. With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm compile&lt;/code&gt;, the hierarchy is built in. The AI just does the right thing.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://microsoft.github.io/apm/guides/compilation/&quot;&gt;Compilation &amp;amp; Optimization Guide&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;automating-the-standards&quot;&gt;Automating the Standards&lt;/h2&gt;

&lt;p&gt;Once you have a pattern, you need a way to use without having to explain it every time. So I built &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;util-apm-builder&lt;/code&gt;: a meta-skill that helps scaffold new packages. Yes, I used an AI tool to build a tool that teaches AI tools how to use AI tools.&lt;/p&gt;

&lt;p&gt;Building this taught me something important about how AI skills actually can be structured. I do have skills that consist of a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKILL.md&lt;/code&gt;. They just describe what the skill doe and have a couple of examples. I have created more advanced skill with workflows and references too. I was used to Claude Code.&lt;/p&gt;

&lt;p&gt;But I was in GitHub Copilot world now and the structure it built for the first skill was really interesting:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Instructions&lt;/strong&gt; — Guardrails about the monorepo’s directory structure. “Department packages go here. Team packages go there. Don’t create folders outside this hierarchy.” Without these, the AI will hallucinate creative new locations for things.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Context&lt;/strong&gt; — The technical knowledge base. Manifest schemas. Valid field values. What &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm.yml&lt;/code&gt; actually accepts. This is the reference material the AI consults mid-task, and without it, you get manifests that look right but fail validation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Prompts&lt;/strong&gt; — The executable task template. “Create a new package” with defined inputs for name, layer, type. This is what the developer actually triggers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKILL.md&lt;/code&gt; at the root makes it a hybrid package: part skill, part instruction set.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://microsoft.github.io/apm/guides/agent-workflows/&quot;&gt;Agent Workflows (Experimental)&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;local-iteration-and-the-playground-strategy&quot;&gt;Local Iteration and the Playground Strategy&lt;/h2&gt;

&lt;p&gt;Let me save you from a mistake I made so you can make different, more interesting mistakes.&lt;/p&gt;

&lt;p&gt;Do not install APM packages at the root of your monorepo during development. I did this. What happens is the AI discovers your package source files in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/department/my-package/&lt;/code&gt; AND the deployed copies in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm_modules/&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.claude/&lt;/code&gt;, and now it’s seeing the same instructions twice from two different locations. It doesn’t know which is authoritative. It gets confused. You get confused. Everyone’s confused. It’s a bad time.&lt;/p&gt;

&lt;p&gt;The fix is stupidly simple: create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/local&lt;/code&gt; folder as your playground. It’s a separate workspace where you install packages using relative path dependencies:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# /local/apm.yml&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;dependencies&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;../department/util-apm-builder&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;../team/frontend-react&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This gives you fast iteration without pushing to a remote registry, and it keeps the source packages and the deployed copies in separate directory trees so the AI doesn’t see double.&lt;/p&gt;

&lt;p&gt;One gotcha: VS Code only discovers skills at the root of an open workspace. So if you’re testing a new skill in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/local&lt;/code&gt;, you need to actually open that folder in VS Code, or set up a multi-root workspace that includes it. I spent ten minutes wondering why my skill wasn’t showing up before I figured this out.&lt;/p&gt;

&lt;p&gt;For git discipline: add &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm_modules/&lt;/code&gt; to your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.gitignore&lt;/code&gt; (it’s like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;node_modules/&lt;/code&gt;, derived, not source), but commit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm.lock.yaml&lt;/code&gt; and the deployed primitives in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.github/&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.claude/&lt;/code&gt;. The lock file ensures reproducibility. The deployed files ensure any developer who clones the repo gets the same AI context without needing to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apm install&lt;/code&gt; first.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://microsoft.github.io/apm/guides/dependencies/&quot;&gt;Dependencies &amp;amp; Lockfile Guide&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;from-supervisor-to-architect&quot;&gt;From Supervisor to Architect&lt;/h2&gt;

&lt;p&gt;There’s a maturity curve to working with AI coding assistants, and most of us are stuck somewhere in the middle of it.&lt;/p&gt;

&lt;p&gt;At the beginning, you’re a &lt;strong&gt;Supervisor&lt;/strong&gt;. You watch every line the AI writes. You correct it constantly. You paste errors back into the chat. You basically do pair programming where your partner has amnesia and you’re doing all the navigating.&lt;/p&gt;

&lt;p&gt;The next level is what I’ve been doing for a while: &lt;a href=&quot;https://www.stephanmiller.com/the-great-vibe-coding-experiment/&quot;&gt;running multiple AI tools on multiple projects simultaneously&lt;/a&gt;, trusting them with larger chunks of work, &lt;a href=&quot;https://www.stephanmiller.com/i-burned-out-on-vibe-coding-came-back-and-rewrote-everything/&quot;&gt;letting them plan and execute while I review the output&lt;/a&gt;. It’s better, but it’s still reactive. You’re managing agents, not engineering systems.&lt;/p&gt;

&lt;p&gt;What APM enables is the jump to &lt;strong&gt;Architect&lt;/strong&gt;. You define the standards, the guardrails, the knowledge hierarchy, and the execution patterns once. You version them. You distribute them. And then every AI assistant that touches any project in your ecosystem automatically knows how to behave, what standards to follow, and what context matters. You stop supervising individual interactions and start engineering the environment those interactions happen in.&lt;/p&gt;

&lt;p&gt;The best part is the escape hatch. “Back in the day” in AI terms is last month. Who knows what will change. APM’s output is native configuration files: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AGENTS.md&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.cursor/rules&lt;/code&gt;, skill definitions. If APM disappears tomorrow, or you decide it’s not for you, those files keep working. You haven’t locked yourself into anything except having better-organized AI context, which is not exactly a downside.&lt;/p&gt;
</description>
        <pubDate>Mon, 13 Apr 2026 07:00:00 -0500</pubDate>
        <link>https://www.stephanmiller.com/architecting-the-future-of-ai-native-engineering/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/architecting-the-future-of-ai-native-engineering/</guid>
        
        
        <category>agentic-development</category>
        
        <category>ai-agents</category>
        
      </item>
    
      <item>
        <title>I Burned Out on Vibe Coding, Came Back, and Rewrote Everything</title>
        <description>&lt;p&gt;I hit a wall with vibe coding. Not a dramatic crash. More like the slow realization that I’d been sprinting for months and couldn’t remember why. I had &lt;a href=&quot;https://www.stephanmiller.com/the-great-vibe-coding-experiment/&quot;&gt;15 projects in various states of “maybe done,”&lt;/a&gt; a GitHub commit chart that looked like a heart monitor, and a growing suspicion that I was building things just to build things.&lt;/p&gt;

&lt;p&gt;Fortunately, freelance writing work picked up right around the same time. Enough to actually pay attention to it. So I stepped away from the side projects, wrote about other people’s technology for a change, and let my own code sit untouched for a few months.&lt;/p&gt;

&lt;p&gt;When I came back, I had no patience for bullshit. And I looked at my projects differently.&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/93myIeRtsN0&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#your-vibe-coded-apps-are-prototypes-and-thats-fine&quot; id=&quot;markdown-toc-your-vibe-coded-apps-are-prototypes-and-thats-fine&quot;&gt;Your Vibe-Coded Apps Are Prototypes (And That’s Fine)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#making-adding-features-the-feature&quot; id=&quot;markdown-toc-making-adding-features-the-feature&quot;&gt;Making “Adding Features” the Feature&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#building-bottom-up-with-verdent-and-claude-code&quot; id=&quot;markdown-toc-building-bottom-up-with-verdent-and-claude-code&quot;&gt;Building Bottom-Up with Verdent and Claude Code&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-60-missing-apis&quot; id=&quot;markdown-toc-the-60-missing-apis&quot;&gt;The 60 Missing APIs&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#making-plans-that-any-ai-agent-can-execute&quot; id=&quot;markdown-toc-making-plans-that-any-ai-agent-can-execute&quot;&gt;Making Plans That Any AI Agent Can Execute&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-same-pattern-different-project&quot; id=&quot;markdown-toc-the-same-pattern-different-project&quot;&gt;The Same Pattern, Different Project&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-changed&quot; id=&quot;markdown-toc-what-changed&quot;&gt;What Changed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;your-vibe-coded-apps-are-prototypes-and-thats-fine&quot;&gt;Your Vibe-Coded Apps Are Prototypes (And That’s Fine)&lt;/h2&gt;

&lt;p&gt;Here’s the thing I couldn’t see while I was in the thick of it: almost everything I’d built with AI coding tools was a prototype. Not in the dismissive sense. These apps worked. &lt;a href=&quot;https://www.stephanmiller.com/electron-project-from-scratch-with-claude-code/&quot;&gt;EmberText&lt;/a&gt; was a functional Electron writing app. Niche Site Factory could generate and manage content sites. They ran. They did things.&lt;/p&gt;

&lt;p&gt;But they were all built top-down. I’d tell the AI “build me an app that does X” and it would scaffold the whole thing, features and all, in one giant session. The problem is that when you build top-down with AI, you end up with something that works but is almost impossible to extend. Every new feature is a negotiation with the existing architecture. You’re not adding to the app. You’re fighting it.&lt;/p&gt;

&lt;p&gt;EmberText was the clearest example. I built it with Claude Code over about 16 hours and $80 in API costs. It had AI integration, text generation, character relationship graphs, plot scaffolding. Impressive on paper. But by the time I realized it should have had a plugin architecture, I was already deep enough that refactoring meant essentially starting over.&lt;/p&gt;

&lt;p&gt;So that’s what I did.&lt;/p&gt;

&lt;h2 id=&quot;making-adding-features-the-feature&quot;&gt;Making “Adding Features” the Feature&lt;/h2&gt;

&lt;p&gt;The insight that changed everything was stupid simple: instead of building an app with features, build an app where adding features &lt;em&gt;is&lt;/em&gt; the feature.&lt;/p&gt;

&lt;p&gt;I’d been using Obsidian for years and it’s in my top 5 favorite software. It’s incredible for notes, planning, and organization. You can even make it distraction-free for writing. It’s just not the default, and “not the default” matters more than you’d think when you’re trying to get into a flow state. I tried to hack around this with my Daily Prompts plugin that launched an alert and opened a daily note in Zen mode. It worked, kind of, but I was still fighting the tool.&lt;/p&gt;

&lt;p&gt;VS Code is for code. Obsidian is for notes. What’s for writing?&lt;/p&gt;

&lt;p&gt;That question led to Veneer, a complete rewrite of EmberText from scratch. Same idea, a distraction-free writing environment, but built from the ground up as a plugin-first architecture. The “Zen-First Shell” concept: when you open it, you see nothing but a clean sheet and your text. Sidebars, ribbons, status bars exist as ghost elements, hidden by default, appearing only when you hover near the edges or hit a hotkey. Everything that isn’t the writing surface has to earn its right to be on screen.&lt;/p&gt;

&lt;p&gt;And critically, every feature is a plugin. The file explorer? Plugin. The markdown editor? Plugin. The command palette? Plugin. Even core functionality ships as plugins that can be swapped, extended, or replaced. This isn’t just for a future community. It makes the whole thing dramatically easier to build with AI, because each plugin is a self-contained unit with clear boundaries. You can hand an AI agent a plugin spec and let it work without worrying about it breaking everything else.&lt;/p&gt;

&lt;h2 id=&quot;building-bottom-up-with-verdent-and-claude-code&quot;&gt;Building Bottom-Up with Verdent and Claude Code&lt;/h2&gt;

&lt;p&gt;I used &lt;a href=&quot;https://www.verdent.ai/&quot;&gt;Verdent&lt;/a&gt; to build the base application. If you read &lt;a href=&quot;https://www.stephanmiller.com/verdent-ai-when-your-ai-coding-assistant-finishes-before-you-can-get-coffee/&quot;&gt;my post about Verdent&lt;/a&gt;, you know this thing is fast. Too fast, honestly. It finished most of the base app, including a file browser sidebar plugin, a markdown editor plugin, and a command palette, in about 220 tokens, roughly $20 worth of credits. There were bugs left when I ran out of tokens, but the foundation was solid.&lt;/p&gt;

&lt;p&gt;But here’s where the process got interesting. Instead of just continuing to add features on top, I switched to Claude Code and did something I hadn’t done before: I asked it to &lt;em&gt;audit&lt;/em&gt; the codebase.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;use your skills and check the repo for best practices
- UI
- Is it themable like Obsidian or VS Code
- Plugin Architecture (and compare to VS Code and Obsidian)
- TypeScript
- Electron
- Structure, Naming Conventions
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I’d forgotten how many skills and plugins I had installed in Claude Code. When I ran this, it deployed four specialized agents in parallel:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Explore subagent&lt;/strong&gt; analyzed the overall project structure, UI patterns, theming, and naming conventions&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Architecture Strategist&lt;/strong&gt; evaluated system design decisions and compared the plugin architecture against VS Code and Obsidian&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Kieran TypeScript Reviewer&lt;/strong&gt; checked strict mode compliance, type safety, interface definitions, and generic patterns&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Best Practices Researcher&lt;/strong&gt; gathered industry standards and found examples from successful projects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not how I was working six months ago. Six months ago, I would have just told the AI to add the next feature and hoped for the best.&lt;/p&gt;

&lt;h2 id=&quot;the-60-missing-apis&quot;&gt;The 60 Missing APIs&lt;/h2&gt;

&lt;p&gt;The audit turned up a lot. Claude gave the codebase an A- (92/100) overall, which sounds great until you read the details. The critical finding was the plugin API gaps. Obsidian provides 60+ plugin APIs. Veneer was missing most of them.&lt;/p&gt;

&lt;p&gt;No modals. No notification system. No context menus anywhere. No way for plugins to subscribe to file or workspace events. No way to extend the CodeMirror editor. The native OS menu had “Open Folder” under “Veneer” instead of “File,” which is the kind of thing that makes you realize the AI built the structure but didn’t think about the conventions.&lt;/p&gt;

&lt;p&gt;I had Claude store all the findings in the project’s docs folder:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;BEST_PRACTICES_REVIEW.md&lt;/strong&gt;: Everything organized by priority with an implementation roadmap&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PLUGIN_API_GAPS.md&lt;/strong&gt;: A detailed comparison against Obsidian and VS Code showing exactly what was missing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;making-plans-that-any-ai-agent-can-execute&quot;&gt;Making Plans That Any AI Agent Can Execute&lt;/h2&gt;

&lt;p&gt;This is the part that parallels &lt;a href=&quot;https://mitchellh.com/writing/my-ai-adoption-journey&quot;&gt;Mitchell Hashimoto&apos;s AI adoption journey&lt;/a&gt;. He talks about “harness engineering,” the idea that every time an agent makes a mistake, you engineer a solution so it never makes that mistake again. Better implicit prompting. Actual programmed tools. The goal is building up an ecosystem where agents get better over time.&lt;/p&gt;

&lt;p&gt;I’m doing something similar, but at the project planning level. Instead of just fixing bugs as they come, I’m creating structured documentation that any AI tool can pick up and execute. My next prompt to Claude was:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Take docs/BEST_PRACTICES_REVIEW.md and docs/PLUGIN_API_GAPS.md and create a
markdown list of TODOs in the docs folder. These should be grouped into tasks
and subtasks. If it is possible to work on some tasks concurrently this should
be mentioned. This file should be able to be used by an AI agent to finish
these tasks. Add enough details to each task to speed up development time.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now I have the work planned in 3 phases across 3 TODO files, plus a final phase listing every Obsidian plugin API that Veneer doesn’t have yet for future development. These are all in the docs folder of the project, version controlled, and written so that any agent, Claude Code, Jules, a VS Code extension with Qwen, whatever, can pick them up and start working.&lt;/p&gt;

&lt;p&gt;This is the difference between vibe coding and what I’m doing now. I’m still using AI to do the heavy lifting. But I’m not just throwing prompts at the wall. I’m using one AI tool to build, another to audit, and then creating structured plans that decouple the &lt;em&gt;what needs to happen&lt;/em&gt; from the &lt;em&gt;which tool does it&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-same-pattern-different-project&quot;&gt;The Same Pattern, Different Project&lt;/h2&gt;

&lt;p&gt;This isn’t just how I rebuilt Veneer. I’m doing the same thing with Niche Site Factory. Instead of telling an AI to “build me a niche site generator” (which is roughly what I did the first time), I started over by building the data model first.&lt;/p&gt;

&lt;p&gt;I took a real project, a sci-fi encyclopedia wiki, and used it to design the content structures. A knowledge graph in PostgreSQL with pgvector for embeddings. 2,622 books ingested into the entities table. Flexible JSONB storage that can handle books, concepts, authors, movies, whatever. The data model came first, the application came second.&lt;/p&gt;

&lt;p&gt;It’s the same bottom-up principle. Don’t build the house and then figure out the foundation. Build the foundation, verify it’s solid, then build up from there.&lt;/p&gt;

&lt;h2 id=&quot;what-changed&quot;&gt;What Changed&lt;/h2&gt;

&lt;p&gt;I think the burnout was actually useful. Stepping away let me see the pattern I was stuck in: build fast, hit a wall, start something new. That’s fine when you’re learning the tools. It’s how I figured out what Claude Code, &lt;a href=&quot;https://www.stephanmiller.com/how-i-built-two-obsidian-plugins-while-kiro-ai-did-most-of-the-work/&quot;&gt;Kiro&lt;/a&gt;, Verdent, and &lt;a href=&quot;https://www.stephanmiller.com/using-jules-to-update-my-obsidian-plugin/&quot;&gt;Jules&lt;/a&gt; are each good at. But at some point, you have to stop prototyping and start building.&lt;/p&gt;

&lt;p&gt;Here’s what’s different now:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Bottom-up, not top-down.&lt;/strong&gt; Start with the architecture and data model, not the features. Let the AI build on a solid foundation instead of improvising one.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Audit before extending.&lt;/strong&gt; Use AI review tools to find the gaps before you pile on more code. It’s cheaper to fix the structure now than refactor later.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Plans as portable artifacts.&lt;/strong&gt; Write TODO files detailed enough that any AI agent can execute them. Don’t marry yourself to one tool.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Plugins as a development strategy.&lt;/strong&gt; A plugin architecture isn’t just for the community. It makes AI-assisted development dramatically easier because each unit is self-contained.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Past work is research.&lt;/strong&gt; EmberText wasn’t a failure. It was a $80 prototype that taught me exactly what Veneer needed to be.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’m still &lt;a href=&quot;https://www.stephanmiller.com/category/vibe-coding/&quot;&gt;vibe coding&lt;/a&gt;. I’m just vibing with more structure now and calling it &lt;a href=&quot;https://www.stephanmiller.com/category/agentic-development/&quot;&gt;agentic development&lt;/a&gt;. And honestly, after a few months of writing for clients and not touching my own projects, coming back to this with fresh eyes and no patience might be the best thing that happened to any of them.&lt;/p&gt;
</description>
        <pubDate>Sun, 08 Feb 2026 01:00:00 -0600</pubDate>
        <link>https://www.stephanmiller.com/i-burned-out-on-vibe-coding-came-back-and-rewrote-everything/</link>
        <guid isPermaLink="true">https://www.stephanmiller.com/i-burned-out-on-vibe-coding-came-back-and-rewrote-everything/</guid>
        
        
        <category>agentic-development</category>
        
        <category>vibe-coding</category>
        
      </item>
    
  </channel>
</rss>
