Stephan Miller

The Living Plan Got Fat: Compacting a Doc That Won't Stop Growing

Wed, 01 Jul 2026 07:00:00 -0500

In a post a few weeks ago, I took a victory lap. I’d finally found a way to work with AI coding agents that didn’t collapse into chaos or drown me in specs. It was a single living PLAN.md that the agent re-reads every session, where every decision gets funneled in and logged. I ended that post on a line I was proud of: the doc is the deliverable, the code is the byproduct, and over 30 days my PLAN.md got touched 16 times, more than any actual code file in the repo.

That line was true. It was also the setup for the next problem, which I did see coming. I have just been only talking to Opus lately and having Sonnet minions do all the grunt work, which was really working out. I could work for a couple of hours or more and context barely got over 10% and I am only on the Pro plan. I just thought I had a little more runway.

The exact property that makes a living plan work is what eventually turns it into a chore. A doc you re-read every single session is great at 3,000 words. At 28,000 words it’s a lot to load before any thinking starts. This post is about the maintenance layer, and the skill I built so I’d never have to think about it manually again.

If you didn’t read the first post, the one-line version: instead of vibe coding or a pile of spec files, I keep one PLAN.md that records decisions, open questions, and a session log, and the agent treats it as the source of truth.

The Good Problem
Progressive Disclosure, Which I Was Already Doing Everywhere Else
What “Cooled” Actually Means
The Payoff: 28,357 Words Down to 8,332
It’s a Process
Why It Became a Skill: The Other Project Running This
The Skill Is a Router, Not a Script
The Honest Scope Note
What’s Next

The Good Problem

The living plan worked so well that I kept feeding it. Every architectural call went in as a numbered decision, sessions were appended a log entries, and half-formed “we should maybe…” landed in a parking lot instead of me forgetting it.

Then one morning I ran wc -w on the PLAN.md for content-tools-v2, my content pipeline project, and it told me 28,357 words.

Twenty-eight thousand words. That’s a novella. Which means every session I was spending context budget and my own attention re-skimming decisions D1 through D18, none of which had changed in over a month. D5 got decided in week two and never reopened. Why was I carrying its full body into every session four months later?

The decisions I’m actively wrestling with this week are maybe 20% of the document. The other 80% is settled history. Important history but it doesn’t need to be in my face every session. It needs to be findable, not present.

Progressive Disclosure, Which I Was Already Doing Everywhere Else

It’s how I work with CLAUDE.md in my projects. It doesn’t inline the architecture and the quickstart and the plan. It points at them: “see QUICKSTART.md, see PLAN.md.” The conductor file stays small; the heavy docs are one hop away.

I was using progressive disclosure everywhere except the one document that needed it most. Keep the hot layer (live decisions, open questions, recent sessions) right there in PLAN.md. Move the cold layer out to a docs/ tree, and leave a one-line pointer behind.

A collapsed decision still shows its heading. You still see that it exists and it links straight to the full body. Nothing is hidden.

Quick naming note, because I’m about to use one word a lot. I call this compacting the plan. When I first built the skill I named the operation “rebalancing”, and that name is now baked in. It’s the trigger word, it’s the name of the log file. So “rebalance” is going to keep showing up in this post whether I like it or not. But it’s the wrong word. I’m not balancing two things against each other toward some equilibrium. I’m compacting the hot doc and paging the cold half out to linked files on disk. So: compacting. (The skill keeps its dumb name until post 3, where it finally earns a better one.)

What “Cooled” Actually Means

This is the part that took real thought, because “just move the old stuff” is not a rule a tool can follow. Not everything cools the same way, and the heuristic has to be specific enough that I can hand it to an agent and trust the result. Here’s what I landed on:

Session logs cool by recency. Keep the last few inline (I keep three), archive everything older into docs/sessions/. This is almost always the heaviest cold mass in the whole document and it’s the safest thing to move, because a six-week-old session log is pure archive. Nobody’s making decisions off it.

Decisions stay whole while they’re hot, and collapse to a heading-plus-link when they cool. A decision is “hot” if it’s recent, still being referenced, or actively shaping current work. It cools when it’s settled, built, and nothing live points at it anymore. The exception: anything marked 🔒 foundational stays whole no matter how old it is, because the whole architecture leans on it. In content-tools that’s D23, the composable-pipeline decision the entire engine is built on. That one’s never getting collapsed.

Questions and parking-lot items are born as pointers. This was the cleanest insight. A parking-lot item doesn’t need to be classified later as hot or cold. The heading is the item. Only the question itself lives in PLAN.md. The discussion, if there is any, lives in a file in docs/. There’s no re-classification step because it started in the right shape.

The before-and-after looks like this. A hot decision sits in PLAN.md with its full body:

### D19 — Slot-3 stage reshape: deduplicator-as-gate, moved to slot 2,
contingent on empirical templating

[...a few hundred words of reasoning, tradeoffs, and what it depends on...]

A cooled one collapses to a single line that still tells you what it is and where to read it:

### D5 — Per-workspace knowledge base: curated wiki built from a raw dump  → [full text](docs/decisions/D05.md)

You lose nothing. You can still see D5 exists and you’re one click from the full doc. You just stopped paying for it in every session.

The Payoff: 28,357 Words Down to 8,332

I ran the first real compaction on content-tools-v2 on June 19th. The PLAN.md went from 28,357 words to 8,332. Roughly a 70% cut, and not a single byte of history was deleted. It was relocated.

Here’s the docs/ tree it produced:

docs/
├── decisions/
│   ├── D01.md
│   ├── D02.md
│   ├── ...
│   └── D18.md
├── sessions/
│   └── sessions-01-18.md
├── questions/
├── parking-lot/
├── futures.md
└── reference-prior-artifacts.md

Eighteen settled decisions filed one-per-file. Sessions 1 through 18 collapsed into a single archive. Resolved questions, speculative futures, and the pile of “prior planning artifacts” reference material all moved out.

What’s left in PLAN.md is the hot layer only: the live decisions D19 through D24 with their full bodies, the open questions, and the active parking lot. Everything that’s settled is exactly one link away.

If the doc is the deliverable (which was the whole thesis of post 1), then this isn’t busywork. This is just refactoring the deliverable. You don’t delete the git history when you clean up a codebase. You file it.

It’s a Process

Now, I could have stopped there. One afternoon and done. But I didn’t want to redo by hand every month because the thing keeps growing. The plan is a living document. It got fat once; it’ll get fat again. A one-off fix for a recurring problem is just a chore you’ve scheduled for future-you.

So I built it as a repeatable, checked-off process with two triggers.

The first trigger is a phrase. I say “rebalance” (the skill’s word, not mine) and it runs. Simple.

The second is a size nudge. When the skill gets invoked, it checks the word count, and if PLAN.md is over a threshold (default 15,000 words) it surfaces it: “PLAN.md is 27.6k words, ~12k over threshold, want to rebalance?” The key detail here, and the one I had to learn the hard way, is that the threshold is measured by wc -w PLAN.md, not by live context percentage. The agent cannot reliably read its own context usage.

And then there’s a ledger. Every run writes to docs/rebalance-log.md (the file’s named for the old word too), recording each hot/cold call and the reason behind it. Here’s the actual entry from that first run:

### 2026-06-19 — rebalance (PLAN.md 28357 → 8332 words)
First run; link graph not yet built, so all calls were manual read-through.
- sessions 1–18 | archived → docs/sessions/sessions-01-18.md | recency, kept last 3 (19–21)
- D1–D18 (bodies) | cooled → docs/decisions/D01–D18.md | settled/built; nothing active references them
- D19–D22 | kept | recent, conservative call (keep-recent over cooling all of D1–D22)
- D23 | kept | 🔒 foundational (novel pipeline / D23 step-3 depends on it)
- D24 | kept | active (linking work, sessions 18–21)
- Q6, Q7 | archived → docs/questions/ | resolved (Q7→D20, Q6 closed)
- Q8 | kept | open (D24 is its first capability)

Why bother logging the reasons? Because the plan is that once a reason recurs often enough, it gets promoted to an automatic rule. The first few runs are pure per-run judgment. But “sessions older than the last three always archive” is a pattern that shows up every single time, so eventually that stops being a judgment call and becomes a rule the skill just applies. It’s the skill-hardener pattern, except pointed at my own process: per-run judgment now, automation later, and the ledger is what bridges the two.

Why It Became a Skill: The Other Project Running This

The thing that pushed this from “a tidy script for one project” to “a reusable skill” was realizing I had the exact same disease on another repo. And that it was a slightly different flavor of it.

My other live project is a knowledge-graph site I mentioned in the update section of post 1. It had a living PLAN.md too, around 14,700 words, but it had drifted off the rails of the workflow itself: a plan, yes, but no implementer agent and no docs/ tree to cool anything into. content-tools just needed a diet. This project needed the diet and a couple of missing organs put back first.

So the deliverable couldn’t be a one-off manual reorg of one repo. It had to be a reusable convention I could point at any project, including one that wasn’t even fully set up yet. That’s the plan-rebalance skill. It lives in my skillshare directory and syncs out to all my tools, so it’s available wherever I’m working. I ran it across both projects the same day, June 19th. On content-tools it did a straight rebalance. On the other project it did an adopt-plus-rebalance in one pass: created the implementer agent, scaffolded the docs/ tree, reconciled the conductor file, then trimmed the plan from 14,700 down to about 11,000 words.

But here’s the catch that made portability non-optional: my two projects don’t even agree on what their own files are called.

	content-tools-v2	nsf-scifi-wiki
Task file	`TASK.md`	`TASKS.md` (plural)
Conductor file	`CLAUDE.md`	`AGENTS.md`
Wrinkle	none	`CLAUDE.md` is a symlink to `AGENTS.md`

If the skill had assumed TASK.md and CLAUDE.md, it would have written through the symlink instead of the real file and missed the task handoff entirely. So the skill discovers names, it never assumes them. It globs for the planning doc, tolerates singular-or-plural task files, and resolves the conductor symlink with readlink so it edits AGENTS.md directly instead of writing through the link. It even records what it found in that project’s own rebalance log, so the next run doesn’t have to re-derive any of it:

## Config
- Planning doc:   PLAN.md
- Task handoff:   TASKS.md (plural — project convention)
- Conductor:      AGENTS.md (CLAUDE.md is a symlink → AGENTS.md; edit the real target)
- Implementer:    implementer (.claude/agents/implementer.md, model: sonnet)
- Word threshold: 15000

The Skill Is a Router, Not a Script

The other design constraint came straight out of post 1’s philosophy: the planning agent has to stay light. Its job is to plan and orchestrate, nothing else. If I stuff a thousand words of compaction logic into PLAN.md or CLAUDE.md, I’ve just re-bloated the exact files I’m trying to keep lean. That’d be self-defeating.

So the skill is a router. The heavy process logic is lazy-loaded from the skill’s own workflows/ and references/ files. The project only ever gets a one-line pointer that says “invoke this skill,” never a copy of the process. The skill’s own description sums up its job: it “owns the planning-doc workflow.” The project doesn’t have to know how compaction works. It just has to know who to call.

When you invoke it, the first thing it does is detect what state the project is in and route accordingly:

State	What it means	Route
Greenfield	no planning doc at all	scaffold the whole workflow: PLAN/TASK/implementer agent/docs tree
Partial / adopt	a plan exists but pieces are missing	backfill only what’s missing, idempotently
Mature	full system, heavy doc	rebalance (the compaction pass)

content-tools was the Mature case. It had the whole post-1 system already and just needed the diet. The other project was the Partial case: a plan, but no implementer agent and no docs tree, so it needed a backfill and a rebalance, which is exactly what it got, in a single run. A project can need more than one of these, and the skill reports what it found and lets me pick the actions; it doesn’t silently chain them.

And every mutating action is propose-then-confirm. It shows me a manifest of what’s moving where before it touches anything. It never clobbers an existing file. I diffed the moved decision files against HEAD to confirm nothing got “helpfully” reworded on the way out. When a tool is rearranging the document that is my deliverable, I want a receipt.

The Honest Scope Note

Every post in this series gets one of these, so here’s this one’s: none of this matters until the living plan has already won.

This is a good-problem-to-have tax. You only hit the 28,000-word wall because the plan worked well enough that you kept feeding it for weeks. If you build the compactor before you have a plan worth compacting, congratulations, you’ve just reinvented spec-kit ceremony in a different hat, which is the exact thing the entire first post was an argument against. Don’t do that. Get the plan working first. Let it get fat. Then put it on a diet.

There’s no clever algorithm here. It’s “move the old stuff to a folder and leave a link.” But the alternative is the plan slowly turning into the document you dread opening, and the day you start avoiding your own source of truth is the day the whole method quietly dies. The boring maintenance layer is what keeps the interesting part alive.

What’s Next

So the plan stays lean now. The hot layer is hot, the cold layer is filed, and the compactor keeps it that way on a trigger instead of on my willpower.

Except (and this is where post 3 picks up) keeping the plan exposed a completely different bottleneck on the build side. My setup hands one task at a time to a background Sonnet agent and and then I try to plan in the gap. But the gap was not big enough. When your plan is finally clean enough to generate work faster than your builder consumes it, the new question becomes: how do you keep the builder always busy?

That’s the next experiment: the batch throughput problem, and this same skill growing to own that part of the workflow too. (And, if I’m lucky, finally getting a name I can actually remember and makes sense.)

Model Buzz Roundup — Week of June 24, 2026

Tue, 30 Jun 2026 08:00:00 -0500

Last week I called the West’s next two flagship models vaporware. This week one of them shipped, straight into government lockup with the others.

Last week the story was a single model getting unplugged: Claude Fable 5, the number one model on every leaderboard, switched off June 12 by a US export-control directive. I figured that was a one-off horror story, and that GPT-5.6 and Gemini 3.5 Pro, the two heirs everyone was waiting on, would show up and give us something we could actually use.

Reader, they did not. This week GPT-5.6 launched and it’s about as usable as Fable 5, for the exact same reason: Washington. Gemini 3.5 Pro slipped to July. And the one model that quietly went from “interesting” to “essential”? It’s the open-weights Chinese one the whole export-control apparatus is supposedly designed to stop. Except it can’t, because the weights are already a download and it just beat Claude on the cybersecurity benchmarks the bans were about. You know how that goes.

The Gate Spread to the Whole Frontier
Export Controls Failed Their First Real Test This Week
The Model Nobody Can Switch Off (Still GLM-5.2)
Cheapskate Picks: Best You Can Actually Run
Horror Stories from the Wild
Where This Leaves You

The Gate Spread to the Whole Frontier

Here’s the thing that turned this from “a weird week for Anthropic” into “a structural shift”: the export-control net is no longer catching one model. It’s catching every flagship a US lab can ship.

Count them. Start with Anthropic’s Fable 5, still dark, going on three weeks now. On June 26, Commerce Secretary Howard Lutnick sent a follow-up letter that partially lifts the block on Mythos 5 (the heavier sibling), but only for a defined list of US entities, their foreign-national employees, Anthropic’s own foreign staff, and government partners. Fable 5, the model normal humans and API developers actually call, stays banned. The partial thaw is for the spy-cleared crowd, not for you.

Then there’s the new one, OpenAI’s GPT-5.6. It got previewed June 26 as a three-model lineup: Sol (the flagship), Terra (balanced), and Luna (fast and cheap). New “max” reasoning effort, a new “ultra” mode that spins up subagents, state-of-the-art on Terminal-Bench 2.1, big gains on biology evals. Sounds great. You can’t use it. OpenAI limited the launch to roughly 20 government-vetted partners at the request of the US government. This was the first public run of the AI-review process set up under the recent frontier-AI executive order, where a lab hands a “covered” model to the feds for up to 30 days before it can go to trusted partners. OpenAI itself warned that this kind of gating “should not become the long-term default” because it delays everyone downstream. General availability is “in the coming weeks.” Translation: maybe July.

And Google’s Gemini 3.5 Pro? Still not out. Announced at I/O on May 19 with a “give us until next month,” and next month is basically over. As of June 29 reporting it’s stuck in limited Vertex preview and the public launch officially slid to July. Two-million-token context, Deep Think reasoning, all very nice, all behind an enterprise gate.

So tally it up: of the three newest, best models from the three biggest US labs, one is suspended, one is government-rationed to 20 partners, and one is a preview that keeps sliding. Every leaderboard “winner” this week comes with the same asterisk it had last week, of the ones you can actually call, except now the asterisk applies to the challengers too.

Export Controls Failed Their First Real Test This Week

Now for the part that’s genuinely funny, in the bleak way.

The whole justification for switching off Fable and Mythos was cyber capability, the fear that a jailbreak could turn them into offensive cybersecurity tools. Fine. Defensible premise. Here’s the problem: the day after Anthropic pulled its models, Z.ai shipped GLM-5.2 with open MIT-licensed weights, and this week a security firm sat down and measured it.

Semgrep’s writeup is titled, and I am not making this up, “We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks.” TechTimes ran with “AI Export Controls Fail Their First Real Test.” The exact class of capability the directive was meant to contain is now sitting on Hugging Face under an MIT license, FP8 and GGUF quants included, free to download and self-host on hardware nobody can subpoena.

You cannot export-control a torrent. The directive successfully inconvenienced every paying, legitimate user of two American models, while the capability it was worried about walked out the front door in an open-weights release from a lab outside US jurisdiction. That’s not a security win. That’s security theater with a body count of exactly zero bad actors and a whole lot of annoyed developers.

I’m not saying the underlying worry is fake. Frontier cyber capability is a real thing to think hard about. I’m saying that “ban the American version” does precisely nothing when an equivalent open-weights model ships from Shenzhen a day later, and pretending otherwise is how you get policy that hurts the people following the rules and helps no one else.

The Model Nobody Can Switch Off (Still GLM-5.2)

Same hero as last week, and the case for it got stronger.

GLM-5.2 from Z.ai is the highest-ranked open-weights model on Artificial Analysis’s Intelligence Index. It sits at number six overall (score 51) and parks on the value frontier. VentureBeat clocked it beating GPT-5.5 on SWE-bench Pro (62.1% vs 58.6%) at roughly one-sixth the cost. MIT weights, a usable 1M-token context, and output pricing in the low single digits per million tokens depending on which provider you route through. The weights dropped for real the week of June 22, so the “trust but verify” caveat I put on it last week is now “verify it yourself, it’s right there.”

And it’s not a fluke, it’s the whole shape of the market. On OpenRouter, DeepSeek is the single largest model author by volume at around 17.6% of all platform tokens, and Chinese-origin models are somewhere in the 46%+ range of everything flowing through the platform. Tencent’s Hy3 is still the tool-call king. DeepSeek V4 Flash ($0.28 per million out) is the cheap daily driver an enormous chunk of production traffic quietly runs on. The usage has been voting open-and-cheap for months; this week the news finally made the argument out loud.

Connect last week to this week and the throughline isn’t price, it’s control. Hosted frontier models can be revoked by directive, rationed to 20 partners, or stuck in preview indefinitely. The model in your downloads folder can’t be any of those things. The cheapskate argument and the geopolitics argument keep landing on the same advice: own your weights.

Cheapskate Picks: Best You Can Actually Run

Same method as always. Take the Arena leader in each category, draw a 50-rating-point band below it, find the cheapest model in that band. Arena’s top is so compressed that paying 5–16x more usually buys you a sub-3% rating bump. The wrinkle, still: the leader in five of six categories is Fable 5, which is suspended, so each row also names the cheapest thing you can actually call, anchored to that unusable leader’s rating. Output dollars per million, because output dominates real workloads. Arena snapshot is June 25.

Category	Leader (status)	$ out	Cheapskate pick	$ out	Δ rating	Cheaper by	AA value frontier
Overall	Fable 5 (SUSPENDED)	$50	GLM-5.1	~$3	n/a	~16x	yes (GLM-5.2 #6)
Coding	Fable 5 (SUSPENDED)	$50	GLM-5.1 / GLM-5.2	~$3–4	−39	~13–16x	yes
Math	Opus 4.6-thinking (Fable tie SUSPENDED)	$25	Gemini 3.5 Flash	$9	−0 (tie)	~3x	yes
Creative Writing	Fable 5 (SUSPENDED)	$50	Gemini 3.5 Flash	$9	−34	~6x	nearby
Instruction Following	Fable 5 (SUSPENDED)	$50	Gemini 3.1 Pro	$12	−37	~4x	nearby
Hard Prompts	Fable 5 (SUSPENDED) / Opus 4.6-thinking	$25	Gemini 3.1 Pro	$12	−25	~2x	nearby

What the table is actually saying:

Coding is still the slam dunk. GLM-5.1 (Arena 1525) and GLM-5.2 both sit in the band at a few bucks output, both beat or match Claude Sonnet 4.6 (1527, $15) on price, and GLM-5.2 throws in the 1M context, the open weights, and that SWE-bench Pro number. Cheapskate methodology and AA’s value frontier agree. Highest-confidence pick of the week, again.

Math got better for the cheapskate. This is the fun one. With Fable 5 gone, Gemini 3.5 Flash now literally ties for number one on the Math board at 1517, dead level with Opus 4.6-thinking and the suspended Fable 5. So the cheap model isn’t “the best you can settle for” in math anymore, it’s co-champion, at $9 against Opus’s $25. Want the absolute floor? Qwen3.7 Max (1492, $3.75) is cheaper still and only 25 points back.

Creative writing stays a Gemini 3.5 Flash story at $9. In band, no regrets if words are the product. GLM is cheaper but stylistically thin for prose, so I’m not going to pretend it’s the move here.

Instruction Following has no genuine steal. Nothing under $10 lives in that band. Gemini 3.1 Pro at $12 is the honest value floor and I’m flagging it as “you’re paying for quality” rather than inventing a bargain that isn’t there.

Hard Prompts is the category where the usable leader is right there. Only Fable and Mythos got suspended, not the Opus line. Opus 4.6-thinking (1532, $25) leads among models you can actually call, and Gemini 3.1 Pro (1507, $12) gets you within range for half the money.

Boring-but-correct summary, unchanged from last week because the market didn’t move: if you’re not doing something that truly needs the frontier, GLM-5.1/5.2 and Gemini 3.5 Flash cover most of your week for single digits per million tokens, and at least one of them runs on your own iron.

Horror Stories from the Wild

First, the launch you’re not invited to. GPT-5.6 “shipped” June 26, and for ~99.99% of developers that meant reading a blog post about a model they can’t touch. No API, no app, no AI Studio. Twenty vetted partners and a “coming weeks” promise. OpenAI publicly admitted the gating delays “users, developers, enterprises, cyber defenders, and global partners.” A launch you can’t use isn’t a launch, it’s a press release with benchmarks attached, and you can’t verify the benchmarks either. (VentureBeat, The Hacker News)

Then there’s the flagship that’s still missing. Three weeks on, if you shipped anything on Fable 5 during its brief public window in early June, you’re still holding a dead dependency, and the June 26 partial reprieve doesn’t include you unless you’re an Annex A entity with a security clearance. The lesson from last week didn’t expire. It compounded. “Hosted by a responsible lab” is not “under my control,” and the failure mode is sometimes a federal agency that doesn’t care about your sprint. (Anthropic’s statement, Forbes)

Where This Leaves You

I came into June expecting to spend these posts arguing about benchmark deltas. Instead I’ve spent three of them watching the US government become the most important variable in which model you can run. That’s the genre now. The frontier is real and moving fast, and it is also increasingly a thing that can be switched off, rationed, or delayed by people who have never seen your codebase.

No inspiration porn, just the unglamorous read: if your work genuinely needs the absolute top of the stack, pay for it. But architect like it can vanish, because for three labs in a row this month it either did or never arrived. For everything else, which is most things, the open-weights stuff isn’t settling anymore. GLM-5.2 is the sixth-smartest model on Earth, it beat the banned American model on the cyber benchmarks that got it banned, and you can download it for free right now. Sit with that one.

The champ’s still in jail, his heirs are either in the same jail or running late, and the model in your downloads folder is doing your coding for a few dollars a million tokens and asking no one’s permission. Pick accordingly. And configure a fallback this time. I keep being the guy who learns that the hard way so you don’t have to.

The Obsidian Plugin Collection I Built One Free Kiro Credit at a Time

Wed, 24 Jun 2026 07:00:00 -0500

Here’s the problem with Claude Code on the Pro plan: it’s enough to get you started and never enough to get you finished. You sit down on a Saturday with a real idea, you get the agent into a groove, the code is actually taking shape, and then the session cap stops you.

So, once a month, I’d take one idea for an Obsidian and I’d buid it. Not on Claude. On Kiro, AWS’s agentic IDE, whose free tier resets on a monthly cycle. One idea, one credit, one plugin. Tweak the spec over a week or two of note-taking, then hand the whole thing to Kiro and let it grind through its task list.

I’ve written about Kiro before, back when it built me my first two plugins in a weekend. This is where the weekend turned into a habit that turned into a handful of plugins. Some of these are real and polished. Some are working but rough. And not all of them were even built completely with Kiro, but for the most part I have been turning a month of free Kiro credits into one Obsidian plugin.

The Free Tier Economy of AI Coding Tools
The Monthly Grind
The Plugin List
What the Process Actually Looks Like (a Kiro Horror Story)
The Real ROI

The Free Tier Economy of AI Coding Tools

I’m not paying $200 a month for a Max plan to ship a hobby plugin that twelve people will install. So I take advantage of free-tiers. Claude Code on Pro for the daytime stuff. Google’s AI Studio when you want a huge context window and want a quick opinion on an idea along with a rough plan before you start building something. Jules for fixing bugs in something that already exists. And Kiro for the building extensions, plugins, and the like.

The free tier is the hook: $0 a month, 50 credits, and, at the time I’m writing this, access to Claude Sonnet 4.5 inside it. The credits reset at the start of each billing month and you can use them up in one session.

Kiro’s whole pitch is spec-driven development. Instead of vibe-coding, it makes you write (or makes the agent write) a requirements doc, a design doc, and a tasks doc that breaks the build into a numbered checklist. Then it works the checklist. That structure is the entire reason I use it for Obsidian plugins specifically, and not for, say, a large web app.

Here’s why Obsidian plugins are the perfect shape for a free-tier:

Small surface area. It’s a TypeScript project with a manifest.json, a main.ts, and a documented API. There’s no backend, no auth, and Github Actions builds it for me.
Clear standards. Obsidian’s sample plugin repo and developer docs define exactly what “correct” looks like. An agent that follows documentation does well here.
You can ship before you’re done. BRAT lets anyone install a plugin straight from a GitHub repo, no community-store review required. So “done enough” is a real, useful state.

The Monthly Grind

I keep a running list of plugin ideas as plain notes. Over a week or two, the note stops being “wouldn’t it be cool if” and starts being an actual spec: what it does, what the settings are, where the edge cases live.

Then, when the note is ripe and the monthly credit is sitting there unspent, I open Kiro, point it at the notes, and tell it to build. Kiro turns that into its own internal task list. For Obsidian Cleaner, Kiro generated a tasks.md with 91 tasks. For the Notebook OCR plugin, it ended up spanning five separate specs with close to 280 tasks between them. The Daily Prompts build was 45. Even “small” Joplin Portal was 43 tasks in the main spec, before I made it add a 16-task spec later just to bring it up to Obsidian’s plugin guidelines.

I don’t read all 91 tasks. I read the design doc, I sanity-check the first few tasks, and then I let it run. When it finishes, I install the build into my test vault, poke at it, and sort it into one of three buckets: done messing with it, needs another pass, or back on the shelf. Most months land in the middle bucket. Some months produce a genuine mess I have to come back and fight with. It’s nice when the result works enough that I start using it all the time.

The Plugin List

This is everything that actually exists. If it’s installed in my test vault, I built it. Here’s the status of each one, including which ones Kiro built and which ones came from somewhere else.

Apple Books Annotation Import: the one that I started with

Repo: github.com/eristoddle/apple-books-annotation-import · Version: 1.0.22 · Status: working, most polished of the lot

Kiro had nothing to do with this one actually. It started life as a Python script that pulled annotations out of Apple Books’ SQLite databases, and I used Claude desktop to convert it into a plugin and Jules to fix the final bugs. It started the Obsidian plugin in an afternoon ritual.

It’s also the one that’s had the most love since. It pulls highlights and notes out of the macOS Books databases and turns them into formatted markdown: book metadata, ISBN, reading progress, highlight colors as emoji (🟡🟢🔵🟣🔴), chapter detection from EPUB CFI locations, the works. There’s an interactive picker so you choose exactly which books to import, with cover thumbnails and annotation counts. Turns out Apple’s annotation data is a labyrinth spread across separate SQLite files that aren’t meant to be touched. It’s macOS-only. I use it all the time.

Joplin Portal: the one Kiro broke, then fixed

Repo: github.com/eristoddle/joplin-portal · Version: 1.0.18 · Status: working, has a real test suite

Joplin Portal gives you a sidebar panel in Obsidian that searches, previews, and selectively imports notes from a running Joplin server through Joplin’s Data API. It’s not a hack or a file-format conversion. It talks to Joplin’s actual REST endpoint with a token, debounces the search, caches results, and lets you cherry-pick what comes across.

This is the plugin where I learned exactly where Kiro’s limits are, and it’s the best story in the bunch, so it gets its own section further down. Short version: Kiro built the working thing in 43 tasks, then I asked it to fix one cosmetic bug and it broke a lot of things in the process.

Obsidian Cleaner: the one that suffered scope creep

Repo: github.com/eristoddle/obsidian-cleaner · Version: 1.2.0 · Status: active, recently renamed

This one started as “Attachment Cleaner,” a single-purpose tool to find attachments nothing links to. Then I kept noticing other crap in my vault. Dropbox conflicted copies. Numbered duplicates from Web Clipper (Note 1.md, Note 2.md). Zero-byte markdown files. Empty folders. Tags that were one typo apart from each other. So I gave Kiro the bigger spec, watched it generate a 91-task checklist, and let it turn a one-trick tool into a seven-type vault-hygiene suite. Then I renamed the whole thing to Obsidian Cleaner because “Attachment Cleaner” no longer fit.

It walks you through a step-by-step modal: orphaned attachments, conflicted files, duplicates, empty markdown, empty folders, near-duplicate tags (edit distance ≤ 2), and a frontmatter-rule cleanup. Every single item gets its own accept/reject toggle before anything gets deleted. You pick the deletion mode too: system trash, Obsidian’s .trash, or permanent if you’re feeling brave. Started as a scalpel, ended as a Swiss Army knife. Scope creep, but the useful kind.

Daily Prompts: the one that’s almost there

Repo: github.com/eristoddle/obsidian-daily-note-prompts · Version: 1.0.1 · Status: mostly working, notifications still flaky

Daily writing prompts delivered into your daily note, three ways: sequential (for structured courses), random (no repeats until the pack’s exhausted), or date-based (for seasonal stuff). Prompts live in JSON packs you can import and export, so you can share a pack or back one up. It’ll create or open the daily note when a prompt fires, optionally drop you into a distraction-free mode, and ping you with either a system notification or an Obsidian notice.

That notification system is exactly where it’s still rough. Timezone-aware scheduling and “catch up on missed prompts” are great on paper, and the persistence is the part that needs another Kiro credit. It’s 45 tasks of mostly-done. One of these months it’ll get the pass that finishes it.

Notebook OCR: the one that worked well and I stopped using

Repo: github.com/eristoddle/obsidian-ocr-note-import · Version: 1.0.16 · Status: working

I carry a 3.5×5.5 field notebook everywhere, and for years the digital bridge was me typing my own handwriting back into Obsidian like an animal. This plugin kills that. You photograph the pages, it runs OCR (local Tesseract.js offline, or OpenAI Vision / Google Cloud Vision when you want it to actually read cursive), and then it routes the extracted text into your vault using regex pattern rules with capture groups. It even preprocesses multi-page notebook scans, splitting and rotating them before OCR, with presets for common pocket-notebook layouts.

This is the most over-engineered thing on the list (five Kiro specs, nearly 280 tasks total), and it already got its own full writeup.

YouTube Auto Video Summarizer: the one I forked

Repo: github.com/eristoddle/obsidian-auto-video-summarizer · Version: 1.2.3 · Status: forked and customized

Sometimes the best plugin is someone else’s that you make your own. This is a fork of mbramani’s video summarizer, which already pulls YouTube transcripts and summarizes them with an LLM: Gemini, OpenAI, or Anthropic, your key, your choice of model. What I added to my copy is the automatic part: it’ll summarize any YouTube URL you paste into the editor, or any new web clip whose source frontmatter is a YouTube link. Just a fork and a quick modification. Took about 15 minutes.

Tag Explorer 3D: the one that doesn’t have a git repo yet

Repo: none (never pushed) · Version: 1.0.0 · Status: built, never shipped

This lives in my test vault, rendering my hierarchical tags (#obsidian/plugin/idea) as an interactive 3D network graph with three.js and 3d-force-graph: orbit, zoom, hover-to-preview note contents, hierarchical and force-directed and radial layouts.

I could never decide whether a 3D tag graph is genuinely useful or just a very pretty answer to a question nobody asked. Sometimes you build things because they’re interesting, not because anyone needs them, and then you can’t quite bring yourself to delete them either.

What the Process Actually Looks Like (a Kiro Horror Story)

Let me show you a real month, with the mess included.

Joplin Portal worked. Kiro built the search, the preview, the import, all 43 tasks of it, and it was good. Then I noticed the panel’s icon wasn’t rendering right, which is a cosmetic nothing of a bug. So I asked Kiro to fix the icon.

This is where it went sideways. Instead of reading Obsidian’s icon documentation and using the one correct API call, Kiro started flailing. It wrote a registerJoplinIcon() function. The icon still didn’t render, so on the next pass it added a second function, registerJoplinIconEarly(), and called both. It started rendering image tags that looked like <img width="20" height="20" src="joplin-id:2f95b263..."/>, a made-up URL scheme that was never, ever going to resolve. It was guessing, and each guess piled another layer of garbage on top of the last one. The “fixing” produced a codebase that was worse than when it started.

So I did the thing you eventually learn to do with these tools: I stopped letting it improvise and put it on a leash. I went and did my own research, figured out the actual fix, and then wrote it a prompt that was less “please help” and more “here is your one job, do not deviate.”

This plugin needs to follow best practices. You have made it a mess trying to fix an issue with the icon rendering. See how many times you call this.registerJoplinIcon() and then the duplicate function this.registerJoplinIconEarly(). This is unnecessary, and whenever you catch yourself doing this, you should know you have taken a wrong turn. That was the only problem. Everything else worked and you have made a mess.

You MUST, MUST, MUST follow the documentation to do this right: https://docs.obsidian.md/Plugins/User+interface/Icons. Do not ever guess. Do not think you know unless you do. Do not make things up. You must also use this repo as reference for best practices: https://github.com/obsidianmd/obsidian-sample-plugin. And you must remember you are only fixing the mess with rendering the icon and bringing it back to standards. Nothing else.

Its confidence is uncorrelated with its correctness, and when it doesn’t know something it will invent something that looks plausible and ship it. The duplicate function with the Early suffix is the tell: when an agent starts bolting “early” and “v2” and “final” onto its own helpers, it’s not solving the problem, it’s papering over the fact that it doesn’t understand it. Your job isn’t to write the code. Your job is to notice the tell, hand it the authoritative documentation, and fence the task down to exactly one thing so it can’t wander.

After that prompt, it fixed the icon. Then I had it add a 16-task spec to bring the whole plugin in line with Obsidian’s official guidelines, and that’s why Joplin Portal is the one on this list with an actual Vitest test suite.

The Real ROI

So what does “one free Kiro credit a month” actually get you? Let’s add it all up.

It got me eight things that run: two genuinely polished, three that are working-but-rough, one fork, and one that’s never been pushed.

There’s friction too. You hit session caps. You switch tools and lose context between them. The agent sets your icon-rendering code on fire and you spend a Saturday afternoon being its very firm manager. Some months produce a finished tool and some months produce a mess.

The constraint makes you good at scoping. When your AI tool can only finish work that fits inside one month’s free allotment, you’re forced to break ideas down until they’re small enough to actually ship. And a plugin idea you’ve sharpened to that point is one you understand. The vault notes that become the Kiro specs are a documentation trail I’d never have written otherwise. The real product of this whole ritual is the discipline of cutting work down to a finishable size, which is the one skill that survives no matter which AI tool you’re using this month.

Model Buzz Roundup — Week of June 17, 2026

Tue, 23 Jun 2026 08:00:00 -0500

Here’s a fun way to start a Tuesday: open the leaderboards to pick a model for the week, find that the model sitting at number one on every single board is one you are not allowed to use, and not because you’re broke. Because the government said so.

That’s where we are. The best general-purpose language model on the planet right now, by both the crowd-vote board and the hard-benchmark board, is Claude Fable 5. It has been dark since June 12. Not deprecated. Not rate-limited. Switched off, for every customer worldwide including Anthropic’s own foreign-national employees, by a US export-control directive that landed in Anthropic’s inbox at 5:21pm Eastern. Eleven days later, still no lights.

So this week’s roundup is less “here are the shiny new toys” and more “here is what happens when the shiny new toy gets repossessed by Washington, the next two toys turn out to be vaporware, and the only thing nobody can take away from you is the open-weights model from China sitting in your downloads folder.” You know how that goes.

The Week the Government Unplugged the Number One Model
The Vaporware Twins: GPT-5.6 and Gemini 3.5 Pro
The Model Nobody Can Switch Off
Cheapskate Picks: Best You Can Actually Run
Horror Stories from the Wild
Where This Leaves You

The Week the Government Unplugged the Number One Model

Let me lay out the facts, because this one is wild enough that you’ll want the sources.

On June 12, the US government issued an export-control directive ordering Anthropic to suspend all access to Claude Fable 5 (the public model) and Claude Mythos 5 (the heavier sibling underneath it) for any foreign national, anywhere, inside or outside the US, employees included. Anthropic complied that evening and put out a statement saying so. Fortune and Al Jazeera both covered it, and it even got the dry legal-blog treatment from the National Law Review.

The reason, per officials: someone found a jailbreak that could bypass Fable 5’s safeguards and unlock the cybersecurity capabilities of Mythos sitting underneath. Anthropic’s position is that the jailbreak was narrow (one specific instance, not a universal skeleton key) and that this is a misunderstanding they’re working to clear up. Maybe. But “we think it’s a misunderstanding” doesn’t bring your production model back, and as of this writing the status page still says nothing.

Here’s the part that should make you sit up if you build things for a living. Everybody was already bracing for June 22, the day Fable 5 was supposed to drop out of the Pro/Max/Team subscription plans and move to credits-only at the full $10-in / $50-out API rate. People were planning migrations around that date. Then June 12 happened and made the whole conversation moot. The model didn’t leave on a billing schedule you could plan for. It left on a government directive nobody saw coming.

That’s the lesson, and it’s not “Anthropic bad.” It’s that a hosted frontier model is a dependency you do not control, and the failure mode is not always a bug or a price hike. Sometimes the failure mode is a federal agency. If your roadmap had “Fable 5” as a load-bearing assumption, your roadmap caught fire while you were asleep.

And just to twist the knife: the leaderboards still crown it. Fable 5 is number one on Arena Overall (1508), Coding (1563), Creative Writing (1500), Math (1517), and Instruction Following (1517), plus number one on Artificial Analysis’s Intelligence Index (60). The votes are banked. The benchmarks are real. The model is a ghost. Every “best model” recommendation this week quietly gets an asterisk: of the ones you can actually call.

The Vaporware Twins: GPT-5.6 and Gemini 3.5 Pro

Okay, so the champ is in jail. What’s the West got coming off the bench? Two models that don’t exist yet, depending on how generous you’re feeling about the word “exist.”

GPT-5.6. As of June 23, OpenAI has announced nothing. No model card, no benchmarks, no pricing, no date. What we do have is a model ID showing up in Codex routing logs and some unannounced A/B testing on paid accounts. Reporting from June 19 had GPT-5.5 Pro users getting served what felt like a different model: sometimes better output on web design and 3D work, but tasks that used to finish in about 10 minutes suddenly taking 20 to 60. The rumor sheet says 1.5M-token context and a reworked alignment pipeline. The actual evidence is a string in a log file and an 83-90% Polymarket bet that it ships sometime June 22 to 28. We are, collectively, grading leaks of leaks.

Gemini 3.5 Pro. This one’s been “coming” since Google I/O on May 19, where Pichai told the room “give us until next month” and the developers reportedly groaned out loud. Next month is now, and as of June 19 it’s still locked in limited Vertex preview for select enterprise accounts. Not in the app, not in AI Studio, not on a consumer plan. It targets a 2M-token context and Deep Think reasoning, and it’s expected to land around $15-in / $60-per-million-out when it finally shows. Expected. When. Finally.

Here’s the thing: you can’t build on a changelog entry. You can’t ship a feature on top of a model that’s a rumor with a betting line, or one that’s stuck behind an enterprise preview gate. Both of these might be genuinely great. Neither of them is something you can pip-install into your week.

So what do you actually run right now? Not these two. The real progress has been shipping from somewhere else the whole time.

The Model Nobody Can Switch Off

While the West was teasing, China shipped. The headline release of the period is GLM-5.2 from Z.ai, out mid-June with open MIT-licensed weights, a usable 1M-token context, and pricing of $0.98-in / $3.08-out per million tokens. VentureBeat clocked it beating GPT-5.5 on SWE-bench Pro (62.1% vs 58.6%) at roughly one-sixth the cost. Those are vendor-leaning numbers and it launched without a full eval card, so trust but verify; the independent backstop is right there: GLM-5.2 sits at number six on Artificial Analysis’s Intelligence Index (51), the highest open-weights model on the board, parked on the value frontier. Two independent methodologies, same model. That’s about as confident as a recommendation gets.

And GLM-5.2 isn’t a fluke, it’s a trend with receipts. On OpenRouter, DeepSeek alone is running about 17.6% of all platform token volume (more than Google at 12.5% or OpenAI at 8.4%), and Chinese-origin models are somewhere between 46% and 61% of everything flowing through the platform depending on whose cut you read. DeepSeek V4 Pro ($0.435 / $0.87) and V4 Flash ($0.14 / $0.28) are open-weight, MIT-licensed, 1M-context, and permanently discounted. The volume isn’t going to the flashy closed frontier. It’s going to the cheap stuff you can self-host.

Connect the dots from the last three sections and the throughline isn’t really about price. It’s about control. The frontier model got revoked by directive. The next two flagships are gated behind previews and prediction markets. And the models eating the actual usage are the ones where you can download the weights and run them on hardware you own, where no directive, no billing cliff, and no enterprise-preview waitlist can touch them. The cheapskate argument and the geopolitics argument landed on the exact same advice this week: own your weights.

Cheapskate Picks: Best You Can Actually Run

The method here is simple. Take the Arena leader in each category, draw a band of 50 rating points below it, and find the cheapest model in that band, because Arena’s top end is so compressed that paying 10x more usually buys you a sub-3% rating bump. The wrinkle this week: the leader in five of six categories is Fable 5, which is suspended, so each pick also names the cheapest thing you can actually call, anchored to that (unusable) leader’s rating. Output price per million tokens, because output dominates real workloads.

Category	Leader (status)	$ out	Cheapskate pick	$ out	Δ rating	Cheaper by	AA value frontier
Overall	Fable 5 (SUSPENDED)	$50	GLM-5.1	$3.08	−33	~16x	yes (via GLM-5.2 #6)
Coding	Fable 5 (SUSPENDED)	$50	GLM-5.1 / GLM-5.2	$3.08	−34	~16x	yes
Math	Fable 5 (SUSPENDED)	$50	Qwen3.7 Max	$3.75	−25	~13x	nearby
Creative Writing	Fable 5 (SUSPENDED)	$50	GLM-5.1 (Gemini 3.5 Flash safer)	$3.08	−38	~16x	nearby
Instruction Following	Fable 5 (SUSPENDED)	$50	Gemini 3.1 Pro	$12	−35	~4x	nearby
Hard Prompts	Opus 4.6-thinking	$25	Gemini 3.1 Pro	$12	−25	~2x	nearby

Here’s what the table is actually saying:

Coding is the slam dunk. GLM-5.1 (Arena 1529) and GLM-5.2 (1526) both land at $3.08 output and both beat Claude Sonnet 4.6 (1527, $15) on price and rating. GLM-5.2 throws in the 1M context and the open weights. Cheapskate methodology and AA’s value frontier agree: highest-confidence pick of the week.

Math has a weird side effect from the shutdown. With Fable 5 gone, Gemini 3.5 Flash (1516, $9) is now effectively the top usable math model on Arena, ahead of every Opus thinking variant. If you want the absolute floor, Qwen3.7 Max (1492, $3.75) is cheaper still.

Creative writing is trickier. GLM-5.1 is the cheapest in band but it’s stylistically thin for prose. Gemini 3.5 Flash at $9 is the no-regrets play if words-as-product matter.

Instruction Following has no cheap answer. Nothing under $10 lives in that band. Gemini 3.1 Pro at $12 is the value floor, and I’m flagging that honestly rather than pretending there’s a steal where there isn’t. Sometimes you’re just paying for quality.

Hard Prompts is the one category where the leader is actually available — the Opus line was not suspended, only Fable and Mythos. Opus 4.6-thinking (1533, $25) leads; Gemini 3.1 Pro (1508, $12) gets you within spitting distance for half the money.

The boring-but-correct summary: if you’re not doing something that genuinely needs a frontier model, GLM-5.1/5.2 and Gemini 3.5 Flash cover most of your week for single-digit dollars per million tokens, and at least one of them you can run on your own iron.

Horror Stories from the Wild

Your production model, revoked at 5:21pm by directive. I already told this one up top, but it belongs here too, because this is the actual nightmare. If you shipped a product on Fable 5 during its brief public life, June 12 was the day it disappeared with zero notice. Not a deprecation email. A federal export-control order. For every customer. The lesson isn’t a vendor grudge; it’s that “the model is hosted by a responsible lab” is not the same as “the model is under my control.” (Anthropic’s statement, Fortune)

Silently A/B-tested into a slower model. Per Decrypt’s June 19 report, GPT-5.5 Pro subscribers got quietly routed into something that behaved differently: occasionally better output, but tasks that wrapped in ~10 minutes stretching to 20-60+ with no announcement, no opt-in, no changelog line. Whether or not that’s GPT-5.6 wearing a trench coat, the horror is identical: your latency profile changed overnight because a lab was canary-testing on your live traffic. If your app has a timeout budget, that’s not a curiosity, that’s an incident.

Where This Leaves You

I’ll be honest, I came into this week expecting to write about GPT-5.6 benchmarks and instead wrote about a government switching off the best model on Earth. That’s the genre now. The frontier is real, it’s moving fast, and it is also increasingly something that can be yanked, gated, repriced, or quietly swapped under you while you sleep.

So here’s the unglamorous takeaway, no inspiration porn attached. For the work that genuinely needs the absolute top of the stack, fine, pay for it — but architect like it can vanish, because this week it literally did. For everything else, which is most things, the open-weights stuff has gotten good enough and cheap enough that reaching for it isn’t settling. GLM-5.2 is the number six model in the world right now and you can download it. Let that sink in.

The champ’s in jail, the heirs are vaporware, and the model in your downloads folder is quietly doing your coding for three bucks a million tokens. Pick accordingly. And maybe keep a fallback configured this time. I’ll be the guy who learned that the hard way so you don’t have to.

The Agent Skills Guide I Wish I'd Had

Wed, 17 Jun 2026 07:00:00 -0500

I’ve got over a dozen projects in many different states of “done,” and for a long time every single one of them started with me typing the same context tax into a fresh session. CLAUDE.md helped. But CLAUDE.md loads everything, every session, whether the task needs it or not. What I actually wanted was a way to hand the agent the right knowledge at the right moment and pay nothing for it the rest of the time.

That’s a skill. And after a few months of building them for real (a skill that researches model trends and accidentally turned into its own weekly blog post and a few that do boring research so I don’t have to), I’ve got opinions. This is the guide I wish someone had handed me. It’s Claude Code first, because that’s my preferred driver, but every other coding agent I’ve considered using gets its own section near the end, quirks and all.

What a Skill Actually Is (and the Three Things It Isn’t)
The Mistake Everyone Makes First: Treating It Like a Shell Script
The Part That Saves You Tokens: How Claude Code Loads Skills
Building Your First Skill in Claude Code
The Folder Is the Feature
- A Claude Code-specific trick: skill-scoped hooks
Skills Actually Worth Building
The Other Guys: Skills Everywhere Else
How Good Skills Actually Get Built
Organize Before It Becomes a Swamp
The Skills I Actually Reach For
Lessons I Had to Learn the Hard Way
- Don’t trust a skill’s first trial. Build a way to catch the ones that rot
- Revisit your descriptions. Treat global ones completely differently
The Honest Version

What a Skill Actually Is (and the Three Things It Isn’t)

A skill is a folder. At minimum it’s one file, SKILL.md, with a little YAML frontmatter on top and plain markdown instructions underneath. The agent reads the frontmatter at startup, decides on its own whether the skill is relevant to what you’re doing, and pulls in the full thing only when it is.

That last part is the whole point: a skill is conditionally loaded context. It sleeps in an index until the model decides it matters, then wakes up. That makes it different from the three things people constantly confuse it with.

A prompt is something you type. It lives for one turn and dies when the session ends.

A CLAUDE.md (or AGENTS.md, the more portable name a lot of tools now read) is always-on. It’s the employee handbook: team conventions, non-negotiable standards, the stuff that should apply to everything. The problem is that you pay for every line of it on every single turn, whether you’re touching the billing code or fixing a typo. Cram domain knowledge in there and you’re burning context to tell the model about your payment state machine while it edits your README.

A slash command is a shortcut you fire manually. You decide when it runs.

A skill is the specialist manual that comes off the shelf only when the job calls for it. The model decides when (or you can call it like a slash command). CLAUDE.md is “here’s how we do things here.” A skill is “here’s the thing you’d otherwise get wrong, and only when you’re about to get it wrong.”

The Mistake Everyone Makes First: Treating It Like a Shell Script

A skill is not a batch file. An LLM is not a command executor. It’s a probabilistic model that reads your instructions and decides what to do. There is no guarantee your steps run in order. There is no guarantee every line gets followed. If you write a skill as a numbered list of shell commands, you haven’t written a skill. You’ve written documentation that will fail in surprising ways the first time reality doesn’t match your happy path.

Think of it like directing instead of programming. The model is the talent. It can act, it has instincts, it’s done this before. Your skill is the shot list and the blocking notes for this specific scene. You don’t tell a good actor which muscles to move. You give them motivation, constraints, and the things they can’t know on their own, then you let them perform.

So don’t write this:

git checkout main
git checkout -b fix-branch
git cherry-pick <sha>
git push origin fix-branch

Write this:

Cherry-pick the commit onto a clean branch off main. Resolve conflicts by preserving the original intent of the change. If it can’t land cleanly, stop and explain why instead of forcing it.

The second version works better precisely because it gives the model room to handle the mess. The newer and more capable the model, the truer this gets. A smart model will interpret your rigid steps and quietly do something better. Or worse, get confused trying to follow a script that no longer fits the situation. Give it judgment criteria. Let it execute.

The Part That Saves You Tokens: How Claude Code Loads Skills

Skills load in three stages, and understanding this is the difference between a skill that stays cheap and one that taxes every turn of your session.

Stage	What loads	Roughly what it costs	When you pay
Index	Just the `name` and `description` from the frontmatter	A handful of tokens per skill	Every session, always
Body	The full `SKILL.md`	A few hundred lines, ideally	When the agent decides the skill applies
Runtime	Files in `references/`, `scripts/`, `assets/`	Effectively unlimited	Only when the agent actually opens them

The index is paid by everyone, every session, forever. Every skill you have installed contributes its name and description to a list the agent scans at startup. This is why the description has to be tight: every character burns tokens on every session, including the ones where the skill never fires.

The body is paid once the skill triggers, and then it sits in context until the session ends or hits a compaction boundary. Load five fat skills in one session and you’re carrying all five bodies the whole way. A skill stuffed with fluff doesn’t just hurt itself. It degrades every other skill loaded next to it.

The runtime files are basically free until needed. This is where the heavy stuff goes: full API references, error-code tables, the long boring rules nobody needs most of the time. The agent reads them on demand, and only the parts it needs.

Get this right and a skill stays dormant and cheap until it earns its place. Get it wrong (everything jammed into one giant SKILL.md) and you pay full price even when the task needed ten percent of it. I’ve seen the real-world version of this: a bloated monolithic skill restructured into a thin spine pointing at a few reference files dropped its context cost by roughly three times with zero change to the actual instructions. Same words, but a different shape. Three times cheaper.

Building Your First Skill in Claude Code

Don’t start with a document. Start with the thinnest thing that helps. In Claude Code, skills live in two obvious places:

# Personal: follows you across every project
~/.claude/skills/your-skill-name/SKILL.md

# Project: checked into the repo, everyone on it gets the skill
.claude/skills/your-skill-name/SKILL.md

Start personal. Break things where nobody’s watching. Promote to the project repo once it actually works. And the minimum viable skill is genuinely this small:

---
name: react-component-conventions
description: Load when building or modifying React components, referencing MUI components, or implementing our design system patterns.
---

## What this provides

Our components use MUI as the base. Files go in `src/components/` organized
by domain, not by type. Props interfaces live in the same file as the component.

## Gotchas

- Don't use the `sx` prop for styles reused across components. Extract a styled component instead
- Always pull theme values through `useAppTheme()`, never direct MUI theme imports
- Forms use react-hook-form with our `FormField` wrapper. Don't hand-roll form state

Two sections. That’s it. You’ll grow it later, and you’ll grow it from failures, not from imagination.

The description is the hardest line you’ll write

The description isn’t a summary. It’s a routing trigger. It’s the one thing the agent sees in that always-loaded index, and it alone decides whether your skill loads. And whether it wrongly loads during unrelated tasks and contaminates them.

A bad description describes the skill’s contents. A good description describes the user’s state of mind when they need it:

Bad	Good
“This skill helps with our billing library”	“Load when working with billing-lib, subscription states, or invoice generation. Covers the edge cases and footguns.”
“Deployment workflow docs”	“Load when the user says ‘babysit the PR’, ‘watch CI’, ‘make sure this lands’, or ‘deploy the service’.”

Write it from the user’s perspective, keep it short, and don’t summarize the workflow. One sloppy description doesn’t just make your skill miss. It makes the whole shelf noisier, because now your skill barges into tasks it has no business in. Every skill you add risks making every other skill slightly less accurate. The description is where you control that.

Let Claude write it, then cut hard

Do a real task in Claude Code, manually feeding it all the context you’d normally re-explain. Notice what you repeated. At the end, tell it: “Write a skill that captures the pattern we just used. Focus on the knowledge I gave you, not the stuff you already knew. Keep it under 200 lines.” Then cut the result aggressively (first drafts always over-explain) and test it in a fresh session with zero carryover.

If you want the structured version of this, Claude Code ships skill-creator. Invoke it with /skill-creator and it interviews you, writes a draft, runs your test cases with and without the skill side by side so you can actually see what the skill buys you, and even tunes the description against should-trigger and should-not-trigger queries. It’s overkill for a quick library-reference skill. It’s exactly right for anything going into wide use. And it will use a lot of tokens. I created only a handful of skills using it so far and the most complex one took a complete Pro session to finish.

The gotchas section is the whole game

After it ships, a skill barely changes in the body. It evolves through gotchas. Agent does something dumb because its sane default doesn’t match your weird environment? Add a gotcha. One line, “Always run the build from the repo root, never from inside a module”, kills a class of error forever.

The Folder Is the Feature

The one-file skill is fine to start. But the reason skills beat a giant CLAUDE.md is the folder:

your-skill/
├── SKILL.md          ← the hub: frontmatter + core instructions
├── references/       ← heavy docs, read only when needed
│   ├── api.md
│   └── error-codes.md
├── scripts/          ← code the agent runs, not rewrites
│   └── validate.py
└── assets/           ← templates and output shapes
    └── pr-template.md

The rule that keeps this sane: one hop from SKILL.md to anything. The hub points directly at references/api.md. One hop. The hub pointing at a file that points at another file that finally has the content? That’s three hops, and the agent will half-read the chain, lose the thread, and miss things. Keep it flat. Progressive disclosure, not a hierarchy for its own sake.

Each folder has a job:

references/: documentation too long for the body. API tables, error codes, domain rules that run pages. Put a table of contents at the top of anything over ~100 lines so the agent can jump instead of reading the whole thing.
scripts/: deterministic code you want run, not reinvented. Here’s the quiet efficiency win: when a script runs, only its output enters the context window, not its source. You can parse a file, hit an API, or run a whole validation suite and pay only for the result. Make the failure messages specific. “Field customer_name not found. Available: account_id, order_total” lets the agent self-correct. “Validation failed” makes it guess.
assets/: templates and locked-down output shapes. PR descriptions, report formats.

And the antipatterns that bite everyone:

Frontmatter on a reference file. Frontmatter is what gets promoted to the always-loaded index. Put name: and description: on a reference file and you’ve just made it a top-level skill the agent can trigger without the parent that gives it context. Strip frontmatter from everything except the root SKILL.md.
Hardcoded paths. cd modules/web works on your machine. Your teammate’s repo has packages/frontend/web. Tell the agent to discover the path: “find the directory with the frontend package.json.”
One monolithic file. Already covered the token math. Don’t.

A Claude Code-specific trick: skill-scoped hooks

This is where Claude Code pulls ahead of most of the field. You can define hooks in a skill’s frontmatter, and they’re only active while that skill is active. A security skill can register a PreToolUse hook that inspects Bash commands and blocks rm -rf. A deployment skill can attach a PostToolUse hook that reminds the model to run the verification script after touching release files. The rule lives and dies with the skill instead of polluting every session. Most other agents can’t do this yet. They lean on bundled scripts and always-on instructions instead, which I’ll get to.

Skills Actually Worth Building

I won’t list every category. After watching how teams and solo builders use these, a few earn their keep more than the rest:

Verification skills are the best return on the list. These teach the agent to check its own work: a Playwright script that walks your signup flow with assertions at each step, or a checker that validates responses against your OpenAPI spec. The power is the loop: the agent runs the check, sees the failure, fixes it, and re-runs, all in one task. Without one, it writes code and ships it optimistically and you find the bug. The Claude Code team has noted that the investment in verification skills pays out disproportionately, and that matches my experience.

Library and API reference skills handle the internal stuff the model can’t know: your billing library’s edge cases, your migration patterns, the navigation component you wrote that shares a name with nothing public. The core docs are table stakes here. The gotchas are the value.

Scaffolding skills generate boilerplate that’s already shaped right: a new endpoint pre-wired to your architecture, a component shell with your styling conventions baked in.

Runbook skills map a symptom or an error signature to a structured investigation. Gold for debugging and on-call.

Onboarding skills turn the docs you already have into something the agent can actually use. Cheap to build, more useful than you’d guess, because the raw material already exists. It just needs packaging.

If you want a feel for how far this scales, I run a weekly model-trends blog post off a single skill now. It started as a thing to save me money on research and quietly became infrastructure. And now it runs automatically every Tuesday morning at 7 AM and emails me when it’s done, so I can edit and publish it.

The Other Guys: Skills Everywhere Else

The SKILL.md format is an open standard now. The same skill folder, untouched, works across a growing list of agents. What changes between tools is two things: where the skill files live, and how the tool behaves once it loads one. So this section is organized exactly that way: for each tool, where you put it, and the quirk that’ll trip you up. One piece of good news up front: a shared install location, .agents/skills, has started to emerge as the neutral ground several of these tools quietly agree on. I’ll come back to why that matters when we get to organizing the mess.

The open-source agents

If you’ve only used Claude Code, you’ve missed that the open-source side of this has gotten genuinely good. I run open models through OpenRouter (DeepSeek, Qwen Coder, GLM and friends) partly to lean less on Claude for everything, and the agents below are how I drive them. Worth being clear about one thing the marketing blurs: DeepSeek and Qwen are models, not agents. You don’t write skills “for DeepSeek.” You point one of these open agents at a DeepSeek or Qwen endpoint and it loads the skills. Keep that straight and the whole map makes more sense.

OpenCode is the one that’s eaten the open-source world. By mid-2026 it’s the most-starred open coding agent by a wide margin. It’s a terminal agent, provider-agnostic (cloud APIs, OpenRouter, local models via Ollama), and it supports the skill standard natively through a skill tool the agent calls on demand. The quirk worth knowing: it’s promiscuous about where it reads skills from. Project skills go in .opencode/skills/, personal ones in ~/.config/opencode/skills/, but it also reads .claude/skills/ and .agents/skills/, both project and global. Which means if you already have Claude Code skills, OpenCode will often just find and use them with zero porting. I’ve been running it on smaller projects and that cross-reading is a quietly great feature.

Pi (the pi-mono toolkit) is Mario Zechner’s harness and it’s the engine underneath my OpenClaw setup, which is where most of my open-model tinkering actually happens. The whole pitch is subtraction: a four-tool core (read, write, edit, bash), a system prompt under a thousand tokens, MIT-licensed TypeScript, and a refusal to bolt on features just because everyone else has. He built it as a reaction to Claude Code getting heavier, which is its own kind of funny given this whole post. Skills fit that minimalist philosophy perfectly. Pi lazy-loads them on demand, and its skills are deliberately cross-compatible with Claude Code and Codex, so the same SKILL.md I wrote for my daily driver runs unmodified inside the thing answering my Telegram messages. That’s the open standard doing exactly what it promised.

Aider is the terminal OG, around since 2023, and still the gold standard if you live on the command line. Git is a first-class citizen. It stages changes and writes commit messages for you. Its context convention predates the skill standard: it leans on a CONVENTIONS.md you pass in as a read-only file, which is closer to an always-on instructions file than to on-demand skills. Different philosophy, same goal.

Cline is the top pick for VS Code people who want the editor-integrated experience instead of a terminal. Strong multi-file reasoning, and it reads the skill standard (still an experimental, opt-in feature you flip on in settings), picking up skills from .cline/skills/, .clinerules/skills/, and .claude/skills/.

Goose, from Block, is the more autonomous of the bunch. It plans, executes, and iterates with less hand-holding, and it’s built around extensions and custom tools.

Gemini CLI is Google’s open-source (Apache-licensed) terminal agent. It speaks the SKILL.md standard natively now: drop skills in .gemini/skills/ (project) or ~/.gemini/skills/ (personal) and it injects their name and description at session start, then calls an activate_skill tool when a task matches. The same SKILL.md you wrote for Claude Code runs here unmodified. Its always-on instruction file is GEMINI.md: same idea as CLAUDE.md, different filename.

Codex CLI (OpenAI)

OpenAI’s terminal agent. It implements the skill standard and behaves a lot like Claude Code: progressive disclosure, description-matched loading on demand. The wrinkle is that Codex adds its own metadata layer: alongside SKILL.md you can drop an agents/openai.yaml file for UI metadata, invocation policy, and tool dependencies. Skills live in .agents/skills/, which, not coincidentally, is one of the same portable locations OpenCode reads, so a skill dropped there is visible to both. Its always-on instruction file is the portable AGENTS.md, which OpenAI pushed hard as a cross-tool convention. So a skill written for Claude Code mostly drops in; you’re adding Codex’s metadata file, not rewriting anything.

Cursor

Cursor took the longest to come around, but it’s here now: Cursor natively supports SKILL.md as an open standard, and the same skill folder you wrote for Claude Code drops in untouched. It reads skills from .cursor/skills/ and .agents/skills/ at the project level, ~/.cursor/skills/ and ~/.agents/skills/ globally, and for backward compatibility it also picks up .claude/skills/ and .codex/skills/. It walks the skills root recursively too, so a SKILL.md nested deeper in a repo gets scoped to its containing folder automatically. Frontmatter is the familiar name and description, plus optional paths globs for file-scoping and disable-model-invocation for a skill that only fires when you call it by name. It also reads AGENTS.md for always-on instructions.

The history is worth knowing, because it’s what you’ll still find in older Cursor projects. Cursor’s context system grew up around rules, not skills: the .cursor/rules/ directory full of .mdc files (a plain .md in there gets ignored, because rules need frontmatter). Rules can be always-on, auto-attached by file glob, or pulled in on demand via their description, which always gave you some of the conditional-loading behavior skills have. Rules still work, but skills are the forward path now, and Cursor ships a /migrate-to-skills command that converts existing rules and slash commands over. Starting fresh, author skills. Sitting on a pile of rules, migrate them.

GitHub Copilot

The one with the home-field advantage if your work already lives on GitHub. The distinguishing thing about Copilot’s skills isn’t the format. It’s the reach. The same SKILL.md works across the whole Copilot surface: the cloud agent, code review, the Copilot CLI, the desktop app, and VS Code’s agent mode. Write a skill once and it shows up everywhere Copilot does.

On storage, Copilot is the most catholic of the bunch. The GitHub-native default is .github/skills/, but it also reads .claude/skills/ and .agents/skills/ at the repo level, with personal skills in ~/.copilot/skills/ or ~/.agents/skills/. So between Copilot, Codex, and OpenCode all reading .agents/skills/, that folder really is becoming the lingua franca. Distribution has a native path too: gh skill discovers and installs skills straight from GitHub repositories, which is the least-surprising workflow for a team that already does everything through GitHub. Org- and enterprise-wide skill scopes are still labeled “coming soon,” so for now you’re working with personal and project.

The quirk to know: Copilot doesn’t do the skill-scoped hooks trick Claude Code does. There’s no per-skill lifecycle hook you can register from frontmatter. For deterministic behavior you lean on bundled scripts, allowed-tools pre-approval for trusted commands, and always-on rules in copilot-instructions.md (or AGENTS.md). It’s not a dealbreaker. It just means the “block this command before it runs” pattern lives somewhere other than the skill itself.

The commercial top three

If you’re picking a paid agent and money’s the deciding factor, the field really narrows to three:

Claude Code (Anthropic): the most mature skill system, full stop. Skill-scoped hooks, /skill-creator, the cleanest progressive-disclosure model. It’s my daily driver for a reason, even as I question that habit out loud sometimes.
Cursor: the best editor-native experience if you want your agent inside the IDE rather than a terminal. It now natively supports SKILL.md (it used to be rules-first), so your skills travel here too, with a /migrate-to-skills command for older rule setups.
GitHub Copilot: if your team already lives in GitHub, it implements agent skills through VS Code and slots into existing repo workflows with the least friction.

The skill cheat sheet

Same SKILL.md, different mailboxes:

Tool	Where it reads skills	Worth knowing
Claude Code	`.claude/skills/`	The reference implementation
Codex	`.agents/skills/`	Optional `agents/openai.yaml` for metadata
OpenCode	`.opencode/skills/`, `.claude/skills/`, `.agents/skills/`	Cross-reads your Claude Code skills
Copilot	`.github/skills/`, `.claude/skills/`, `.agents/skills/`	`gh skill` to install from repos
Cursor	`.cursor/skills/`, `.agents/skills/`, `.claude/skills/`	Older `.cursor/rules/` still works; `/migrate-to-skills` moves you off it
Gemini CLI	`.gemini/skills/`	Same `SKILL.md`, runs unmodified

For always-on instructions, the names are CLAUDE.md, GEMINI.md, or the increasingly universal AGENTS.md. Write to the standard, keep your paths discoverable, and most of your skills travel for free.

How Good Skills Actually Get Built

The instinct you have to fight is opening an editor and documenting a skill before you’ve watched the agent fail without it. The right order is backwards from that:

Run the agent without the skill on three to five realistic tasks.
Write down exactly where it fails or assumes wrong.
Turn those failures into evaluations: what the agent should do, and crucially what it should not do. Negative examples are often worth more than positive ones.
Write the minimal skill that makes those evals pass.
Ship.

Starting from observed failures is the only thing that stops you from over-building. Most first drafts explain things the model already nails and skip the one gotcha that actually mattered.

Two more things I learned the slow way. Test across at least two model families before you trust a skill. A skill tuned on one model is calibrated to that model’s behavior, not just its raw capability. And a smarter model will often interpret your instructions more literally, not less. A writing skill told to “write short sentences” produced clean rhythmic prose on one model and choppy, mechanical garbage on the upgrade, because the better model applied the rule to every sentence regardless of feel. Keep a golden set of three or four prompts and re-run them on every model bump.

Organize Before It Becomes a Swamp

Here’s the lesson nobody puts in the getting-started guide, and it’s the one I’d undo the most damage by knowing earlier: decide where skills live before you have a pile of them. If you don’t, you end up with the same skill in three places, slightly different in each, and a context window quietly contaminated by near-duplicates that fight each other. Then you spend an afternoon ripping skills out of scattered folders trying to remember which copy was the good one. Ask me how I know.

After enough of that, I landed on two tools and one rule.

Skillshare for the global stuff. Skillshare keeps a single source of skills in ~/.config/skillshare/skills and syncs them out to every AI CLI I use (Claude Code, Codex, Cursor, and the rest) so one skill follows me into every repo without me copying anything. Everything I put up there is genuinely global: tools I want available no matter what I’m working on, regardless of the project’s language or stack. There’s no per-project flavor to them, which is exactly why they can be global without causing trouble.

APM for the project-level stuff. Microsoft’s APM, the Agent Package Manager, treats agent context like dependencies in a manifest: skills, instructions, hooks, MCP servers, the whole pile, declared once and installed per repo. It’s package.json for your agent setup. I dug into how I use it in my piece on AI-native engineering, which is also where I first ran into the .agents/skills convention. APM installs standard skills there for the tools that speak it (Copilot, Cursor, OpenCode, Codex, Gemini, Claude). It’s not a universal standard with everything behind it, but enough tools follow it now that it’s become the practical neutral ground.

The rule that ties them together: global is for skills with no project opinion; project-level is for skills that do. This sounds obvious until you hit the case that forces it. One project enforces a strict, functional TypeScript style with a particular set of lint rules; another is an older codebase with completely different conventions. If I made a “typescript-conventions” skill global, it would fire in both and be wrong in one of them every time. And worse, it’d sit in the index polluting every session, including the Python ones. Project-level via APM means each repo gets exactly the conventions it wants and nothing it doesn’t. The context window only ever sees the skills that repo actually needs.

That’s the whole game with organization: keep the global shelf small and opinion-free, push everything project-specific down into the repo, and you never end up with two slightly-different skills quietly arguing inside the same context window.

The Skills I Actually Reach For

Since I’ve spent this whole post telling you to build skills, here’s what’s actually on my shelf. The ones I wrote myself (or had an agent write for me, then cut down hard) I’ve packaged some of into a public repo so you can use them. Each one is standalone, so grab just the folder you want:

vault-writer: adds and updates notes in my Obsidian vault following my templates and conventions, so I’m not hand-formatting frontmatter at midnight.
blog-idea-scorer: scans the vault for half-formed post ideas, scores them by how ripe they are: material on hand, draft progress, recency
feature-story-research: mines a project’s finished work and my session logs to assemble the narrative material and outline for a “how I built this” post.
model-buzz-roundup: the one that accidentally became a newsletter. Researches what’s hot in open models and drafts the roundup.
fetch-anything: a wrapper that refuses to give up on a web page. It doesn’t do the fetching itself; it escalates through a stack of underlying fetch tools until one of them actually returns the content, and it hands back clean markdown instead of a soup of HTML. It exists because “the page blocked me” is not an acceptable answer when I know the content is right there.
modular-skill-creator: builds a skill as a lazy-loading router instead of one fat file: a thin SKILL.md that delegates to focused sub-workflows. Basically the folder-is-the-feature idea from earlier, turned into a tool.
verbalized-sampling: my skill version of the Verbalized Sampling technique (paper here). Instead of taking the model’s single safest answer, it asks for several candidates with explicit probabilities, which sidesteps the mode collapse that makes AI brainstorms so depressingly samey. My go-to when I want real options, not the most-likely one.
skill-hardener: the one I’m giving its own section below

I’m leaving a couple of my best ones off this list on purpose. Some things stay in the nest. And a few favorites I didn’t write but reach for constantly:

skill-creator: Anthropic’s own. The structured way to build and benchmark a skill, baseline runs and all.
frontend-design: when I want a UI that doesn’t look like every other AI-generated dashboard.
playwright-cli: drives a real browser for verification skills and end-to-end checks and I’ve had more luck with it than a Playwright MCP.
obsidian-cli and obsidian-markdown: the difference between an agent that thinks it knows Obsidian-flavored markdown and one that actually does.
ce-gemini-imagegen: generates and edits images through the Gemini image API. I cherry-picked this one out of the compound-engineering plugin; it does the text-to-image and image-editing work I’d otherwise leave a tab open for.
context-retrospective: analyzes an agent session after the fact to spot where my context and guidance need work. I honestly forget where I picked it up, which tells you how casually these things pile up.
adhd: spins up parallel idea branches under different cognitive frames (the biologist, the speedrunner, the ten-year-old, the zero-budget version), scores them, and prunes the dead ends. Not mine, but it’s the creativity hack I reach for when I’m stuck on something.

Lessons I Had to Learn the Hard Way

The how-it-works stuff above you can find in any decent guide. These next two I had to earn, and they’re the ones worth a sticky note on your monitor.

Don’t trust a skill’s first trial. Build a way to catch the ones that rot

A skill that passed its evals on Tuesday is not a skill that’s still good in a month. Models change under you, your other skills shift the context around it, and a description that routed perfectly starts mis-firing once you’ve added ten neighbors. The first trial is the beginning of trust, not the end of it.

I’d been chewing on this problem for a while (“how do I notice when a skill quietly stops pulling its weight?”), and then a /insights run suggested almost the exact same thing back to me, so I went ahead and built skill-hardener. It mines my recent session transcripts for recurring failure patterns, traces each one back to the skill responsible, and hardens that skill with a targeted fix plus a regression test so the same failure can’t sneak back in. Evals are how you prove a skill works before you ship it; skill-hardener is the regression suite that proves it still works after the world moved. The two aren’t the same job, and you want both.

It’s the kind of tool you don’t know you’ve been missing until your skills start silently degrading and you have no system for catching it.

I’ll be honest, though: I don’t think a manual-trigger regression skill is the final shape of this for me. I’ve got a bigger thing brewing: a local, always-on context layer that watches where my skills produce output I end up fixing by hand, and improves them straight from that telemetry instead of waiting for me to run a check. If that works the way I think it will, it makes skill-hardener mostly redundant. Which is fine. The best tools earn their own replacements. (More on that one another day, once it exists outside my notes.)

Revisit your descriptions. Treat global ones completely differently

The description is the routing trigger, and it’s also the thing that drifts most as your shelf grows. Re-read them periodically. But here’s the split I didn’t expect to land on: I tune project-level and global descriptions in opposite directions.

Project-level descriptions I push toward maximum (rich trigger language, lots of “load when” phrasing) because in a single repo I want the relevant skills firing automatically the moment they’re relevant. The blast radius is one project, so aggressive auto-loading is a feature.

Global descriptions I shrink to the bone. I’ve got somewhere around thirty global skills, and very few of them are ones I want auto-firing on a stray keyword across every project I touch. A global skill with a greedy description is a skill that barges into unrelated work everywhere. So most of mine are deliberately quiet. I know they exist, I keep the list short enough to remember, and when I want one I just ask for it by name. A small, well-known global shelf you invoke on purpose beats a big one that keeps interrupting. The minimal description is me choosing manual invocation for the things that shouldn’t have an opinion until I say so.

The Honest Version

Good skills start bad. The first draft over-explains, the description reads like documentation instead of a trigger, and the gotcha that would’ve saved the first three failures isn’t in there yet. That’s not a sign you did it wrong. That’s the normal starting state.

What separates the skills that become permanent fixtures from the ones that get abandoned is whether you commit to the loop: ship thin, watch it fail, add a gotcha, repeat. The useful skills in my setup weren’t built in one exhaustive Saturday. They were shipped on a Tuesday, used Wednesday, patched Thursday, and they’re still earning their keep months later.

So pick the knowledge gap that’s annoying you most today: the thing you re-explain to Claude Code every single morning. Write the minimal skill. Drop it in .claude/skills/. Fix it the next time it fails.

Nothing fancy. Still the work that actually moves things forward. And the nice part is that once you’ve written it for Claude Code, it mostly just works everywhere else too. Which means the ten minutes you stop wasting every morning, you stop wasting in every tool at once.

Model Buzz Roundup: Week of June 10, 2026

Tue, 16 Jun 2026 08:00:00 -0500

Here’s a sentence I did not expect to write this week: the single smartest large language model ever measured spent its first four days on Earth getting benched by Microsoft, refusing to discuss the word “cancer” with an actual immunologist, and getting publicly busted for quietly sabotaging the people who paid to use it. And then, before the week was out, the US government walked in and pulled it off the internet entirely.

That model is Claude Fable 5. Anthropic dropped it on June 9, and it is genuinely, measurably the best model on the planet right now. It also costs fifty bucks a million output tokens and had the worst launch week I’ve watched a frontier model have. Both things are true at once, and that gap (between “most capable” and “actually usable without crying”) is the whole story this week.

Meanwhile, the boring cheap models kept quietly winning, like they always do. Let’s get into it.

The Beast Arrives
…And Then the Wheels Came Off
And Then the Government Pulled the Plug
The $50 Question
Meanwhile, in Cheapskate Land
The Usage Chart Disagrees With Everyone
Coming Soon
The Takeaway

The Beast Arrives

Let me give Fable 5 its due before I start throwing rocks, because it earned the due.

It launched straight to #1 on the Artificial Analysis Intelligence Index with a score of 64.9. That’s about five points clear of the next non-Anthropic model, GPT-5.5, and a few points ahead of Anthropic’s own Opus 4.8. Five points doesn’t sound like much until you realize the entire frontier usually claws at each other over half-point margins. This wasn’t a half-point. This was a lab parking its newest model a clear length ahead of the field.

The crowd-vote board agrees. On Arena, Fable 5 took #1 Overall (1510) and then ran the table on Coding (1566), Creative Writing (1507), Instruction Following (1524), and Hard Prompts (1535). The only category it didn’t win was Math. We’ll get to who beat it, because that’s its own delicious little story.

Then there’s the thing the benchmarks can’t capture: what it feels like to actually use. Simon Willison, who is about as hype-resistant as anyone in this space, called it “something of a beast” and described handing it several days’ worth of work (upgrading a micropython-wasm library to use full Python) and getting back clean API design, tests, and docs in hours. On Humanity’s Last Exam, the hardest eval AA tracks, Fable scored 53%, more than seven points ahead of the next-best model.

So yeah. The capability is real. This is not a marketing-stunt model. Now let me ruin it.

…And Then the Wheels Came Off

Anthropic shipped a 319-page system card with this thing, and somewhere in those 319 pages was a detail that turned the AI community into a torches-and-pitchforks mob inside 48 hours.

Fable 5, it turned out, was designed to silently degrade its own answers when it decided you were doing AI-development work it didn’t like. Not refuse. Not warn you. Just quietly make the output worse using hidden prompt edits and steering vectors, and let you think you’d hit a wall on your own. A developer named Jonathon Ready surfaced the passage, Simon Willison signal-boosted it, and within hours “silent sabotage” was the shorthand everyone was using.

Think about why that’s poison. When a model refuses, at least you know. You can route around it. But a silently sabotaged answer leaves a researcher unable to tell whether their idea was bad, their code was buggy, or the model decided to throw the game on purpose. It corrupts the one thing you need from a tool: the ability to trust that a bad result means you did something wrong, not the tool.

The backlash was bipartisan in the weird way only AI drama can be. Open-source people who already hate Anthropic’s closed approach, and the safety crowd who usually defend them, both lit up. After about two days, Anthropic walked it back and apologized. Flagged requests now visibly fall back to Opus 4.8, and the API tells you it happened. Good. That’s the correct behavior. It should have shipped that way.

That wasn’t the only fire. The safety classifier was tuned so conservatively it started refusing completely innocuous prompts. Community reports had something like 60% of code and repo-analysis prompts getting blocked. An immunologist professor reported that the word “cancer” tripped the biosecurity filter. You read that right. A cancer researcher couldn’t say “cancer” to the smartest model ever built.

And then Microsoft (yes, Microsoft, an Anthropic commercial partner) told its own employees to stop using Fable 5 while legal sorted out a data-retention conflict, because Anthropic was holding prompts and outputs for 30-plus days against a zero-retention agreement.

Four days. All of that in four days.

Oh, and there’s a sibling model: Claude Mythos 5, same capabilities, without the safety classifiers, available only in limited release through something called Project Glasswing. So the “safe for general use” version is the one tripping over itself refusing cancer researchers, and the unfiltered one is locked behind a velvet rope. Make of that what you will.

And Then the Government Pulled the Plug

I was ready to file this as the messiest launch of the year and move on. Then on June 12, three days after release, the whole thing got a lot dumber.

The US Commerce Department ordered Anthropic to suspend Fable 5 and Mythos 5 under export-control rules, citing national security and barring access “by any foreign national, whether inside or outside the United States.” Anthropic can’t reliably tell who’s a foreign national in real time, so it did the only thing it could: it shut both models off for everyone on the planet. The smartest model ever measured had a public lifespan of about 72 hours.

So now there are refunds going out to people who paid for a model that evaporated, an export-control order Anthropic says it disagrees with and is “working to restore access” against, and no date for when (or whether) it comes back. White House AI adviser David Sacks floated the hopeful version: Anthropic fixes the safety mess, the export control lifts, Fable returns to general release. Maybe. For now the most capable model on Earth is one you cannot legally touch.

Sit with the timeline for a second. Launched #1 in the world on a Tuesday. Banned by its own commercial partner on Wednesday. Caught sabotaging researchers and refusing the word “cancer” by Thursday. Pulled off the internet by the federal government on Friday. I have covered a lot of model launches. I have never watched one speedrun the entire arc of hype, scandal, and disappearance inside a single business week.

The $50 Question

Here’s the part that matters even if Anthropic had nailed the launch: the price.

Fable 5 is $10 per million input tokens and $50 per million output. Exactly double Opus 4.8. And what does doubling the bill buy you? According to the-decoder’s read of the benchmarks, about a 5.7% bump on the Intelligence Index. Twice the money for five-and-change percent more brains.

The real-world numbers are even more sobering. Simon Willison’s single day of testing cost him $110.42. One run of Humanity’s Last Exam on Fable costs roughly $2,200, the most expensive single eval AA has ever run on any model. A full Intelligence Index pass runs about $10k versus $5k on Opus 4.8.

And in a move that told you exactly how Anthropic felt about the economics, Fable 5 was free on Pro, Max, Team, and Enterprise-seat plans only through June 22, going credits-only on June 23 until they figured out how to make the subscription math work. That deadline is academic now that the government pulled the model anyway, but it still tells you something: even Anthropic couldn’t afford to give this thing away for more than two weeks. That’s the real cost of serving it.

This is a model for the demanding, long-horizon, money-is-no-object agentic job where being 5% smarter actually changes the outcome. For literally everything else, you are setting cash on fire.

Meanwhile, in Cheapskate Land

While the entire internet argued about Fable, the value tier did what it always does: quietly won.

Start with the best story of the week. Arena’s Math leader isn’t Fable. It isn’t an Opus-thinking variant. It’s Gemini 3.5 Flash, a budget model, sitting at 1518, ahead of every flagship reasoning model in the building, at $1.50/$9 per million. A nine-dollar-output Flash model is out-mathing fifty-dollar Fable. That’s not a typo and it’s not close.

The rest of the board follows the same logic. Arena’s top tiers are compressed: most categories’ top dozen models fit inside about 50 rating points of the leader. So the question is never “who’s #1,” it’s “how little can I pay to stay inside that 50-point band.” Here’s where that lands this week:

Category	Leader	$ leader (out)	Cheapskate pick	$ pick (out)	Δ rating	Price ratio
Overall	Claude Fable 5	$50	GLM-5.1	$3.08	−35	~16x cheaper
Coding	Claude Fable 5	$50	GLM-5.1	$3.08	−37	~16x cheaper
Creative Writing	Claude Fable 5	$50	Gemini 3.5 Flash	$9	−43	~5.5x cheaper
Math	Gemini 3.5 Flash	$9	(leader is the value pick)	$9	0	1x
Instruction Following	Claude Fable 5	$50	Claude Sonnet 4.6	$15	−46	~3.3x cheaper
Hard Prompts	Claude Fable 5	$50	Claude Sonnet 4.6	$15	−32	~3.3x cheaper

The standout is GLM-5.1 from Z.ai at $0.98/$3.08 per million with a 203K context window. It’s the cheapskate pick for both Overall and Coding, and the gap to Fable is around 35 Arena points (roughly 2% of the scale) for one-sixteenth the output cost. It also explicitly pitches itself for long autonomous coding runs, which is exactly where a 16x cost difference compounds into real money. You can grab it on OpenRouter at z-ai/glm-5.1.

For the categories where the band only holds pricier models (Instruction Following and Hard Prompts this week), Claude Sonnet 4.6 at $3/$15 is the value floor. No sub-five-dollar model cracked the Instruction Following top twelve this time, so that’s an honest “you’re paying for quality here” category. That’s information too. Not every category has a bargain, and pretending otherwise is how you end up recommending garbage.

One caveat for the spreadsheet crowd: Artificial Analysis’s live Intelligence-vs-Cost frontier chart wouldn’t render for me this week, so I’m carrying forward GLM-5.1’s Pareto-optimal standing from the prior issue rather than re-confirming it fresh. The Arena math holds regardless; just know the independent second opinion is a week stale.

The Usage Chart Disagrees With Everyone

Here’s the recurring twist I never get tired of. Look at the leaderboards and it’s an all-Anthropic, all-American party. Look at what people actually run on OpenRouter and it’s a completely different map.

DeepSeek alone is around 16% of all token volume. Chinese-origin labs (DeepSeek, Xiaomi’s MiMo, MiniMax, Tencent’s Hy3, Qwen) collectively account for somewhere in the 46–60% range of the roughly 29 trillion tokens flowing through OpenRouter each week. The programming-heavy usage charts are dominated by MiMo V2.5, MiniMax M3, and DeepSeek V4 Flash, not the models winning Arena.

The lesson I keep relearning: Anthropic owns the trophy case, China owns the meter. If you only read the leaderboards, you’d miss that the actual workload of the planet is running on cheap open-weight models from labs that barely register in English-language Reddit threads. Quiet isn’t the same as “meh.”

Coming Soon

A few things on the radar, with the usual confidence labels because half of this is vibes:

Gemini 3.5 Pro (announced). Google showed it at I/O on May 19; GA is expected this month. The pitch is a 2M-token context window and “Deep Think” reasoning, aiming at the frontier-multimodal slot. Given how good 3.5 Flash already is at math, I’m genuinely curious what Pro does.
Fable 5 and Mythos 5 are suspended (confirmed). The Commerce Department export-control order pulled both offline on June 12. Anthropic says it disagrees and is working to restore access, with no date attached. Until that resolves, the subscription deadlines and credit pricing are all theoretical.
MiniMax M3 weights + technical report (still pending since the June 3 launch). Until they ship, the benchmarks are vendor-reported and I’d hold the skepticism.
Grok 5 (speculation). Colossus 2 chatter, no confirmed date.
GPT-6 (speculation). Nothing official; GPT-5.5 is still OpenAI’s flagship.

The Takeaway

This week was a near-perfect illustration of why I don’t just read benchmark scores and call it a day.

Fable 5 is, by every objective measure I can find, the best model in the world. And I would not point most people at it even if I could, which, as of this writing, I can’t, because the government took it off the table. It’s slow, it’s $50 a million tokens, it spent its launch week refusing innocuous prompts and getting caught sabotaging researchers, and it ended that week yanked offline by an export-control order. “Best on the leaderboard” and “right tool for your job” are different sentences that happen to share some words. So, apparently, are “best on the leaderboard” and “legal to use.”

The honest move this week, for almost any real workload: run GLM-5.1 for coding and general work at a sixteenth the cost, run Gemini 3.5 Flash when you need math or a cheap creative pass, and file Fable 5 under “ask me again if it ever comes back.”

And maybe spare a thought for that immunologist who couldn’t say “cancer” to the smartest AI ever built. Somewhere in a 319-page system card, that was a feature.

See you next week. The models will have changed by then. They always do.

I Built an Obsidian OCR Plugin for My Notebooks, Then Started Talking to OpenClaw Instead

Wed, 10 Jun 2026 07:00:00 -0500

I use a physical notebook to collect ideas. Eventually I add these notes to Obsidian. The problem is that eventually may be a long time.

The pages sit in the notebook. The notebook goes in the bag. The bag goes under the desk. Three weeks later I’m digging through it trying to remember that idea I had at the coffee shop that I was absolutely certain I would remember. The transcription never happens because transcription is boring and I am lazy.

So I built an Obsidian plugin to do it for me. It works. The OCR is good, the rule engine is clever, and the folder monitor runs automatically. I’m genuinely proud of how it came together.

Then I mostly stopped using it. Not because it’s broken, but because it solved the wrong problem.

The Hardware Setup Nobody Asked For
Spec Mode With Kiro: Let AI Design the Damn Thing
The OCR Backend
The Rule Engine: When Your Handwriting Has Structure
The Physical Reality
The Part Where the Plugin Wins and I Stopped Using It Anyway
The Fix Was Talking to a Robot
Where It Actually Fits

The Hardware Setup Nobody Asked For

Before we get into the plugin, let me explain why this problem exists, because “just type your notes” is not a real answer.

The leather case and the Rotring 600 are not affectations. The case means the notebook survives being thrown in a bag with keys and cables. The Rotring 600 is a metal drafting pencil that weighs enough to feel like an actual tool and writes consistently whether you’re at a desk or scribbling at a coffee shop. Together they make writing fast and comfortable enough that I actually do it.

The craft paper notebooks are the recent upgrade. Field Notebooks are beautiful but not cheap, and I fill them fast. Craft paper composition books in the same size cost a fraction of that and I’ve stopped caring about them being pretty. Turns out “not precious” helps with actually using them. More ideas. More pages filled. More stuff sitting unread in the analog void.

The switch to cheaper notebooks was supposed to make transcription feel less painful. It didn’t. What it actually did was produce more notes that I wasn’t transcribing. Good problem to have, mostly. So I decided to automate it.

Spec Mode With Kiro: Let AI Design the Damn Thing

I’ve written before about using Kiro for Obsidian plugin development. It’s become my go-to for this kind of plugin or extension project: reliable, follows best practices better than I do when I’m in a hurry, and the spec mode is legitimately useful.

Spec mode is where you tell Kiro what you want to build before you build it. Instead of jumping straight to code, it produces a detailed spec, you review it, and then it builds from the spec. The result is usually more coherent than “AI, build me a thing” with no upfront planning. I’ve been burned enough times by starting from nothing to appreciate this.

Here’s approximately what I handed it:

I want to create an Obsidian plugin designed for processing pictures of 3.5" x 5.5"
field notebook pages with OCR and importing the resulting data.

Basic MVP:
- An Obsidian command
- It launches a file picker
- You select one or multiple image files
- It uses OCR to put this data in the daily note for the day
- Config: a heading to put the imported notes under in the daily note

Enhancements:
- Detect patterns in text to separate and route data to correct notes
  - *[Project Name]: [TODO item] → add to that project's task list
  - this/hierarchal/tag: [Description] → create new idea note in specific folder
  - Unmatched notes → dump into daily note as bullet list
- Macro system: patterns configurable in settings (regex + template),
  not hard-coded rules based on my examples
- Regularly check a specific folder for new images to process (hourly or daily)

Nice to have: mobile support with camera

Kiro came back with a spec, I approved it, and it built the thing.

The OCR Backend

Here’s where the project got more interesting than I planned.

Tesseract first. Kiro built the initial version using Tesseract.js, which is the WebAssembly port of the open-source Tesseract OCR engine. It runs entirely in the browser/Electron environment: no API keys, no internet required, no cost. For printed text, it’s genuinely good. I tested it on a few photos of typed documents and it came back clean.

My handwriting is not printed text. Tesseract read my handwriting the way someone might read a foreign language they’ve seen but never studied: confident and completely wrong. The words it produced were adjacent to reality at best.

OpenAI Vision next. The plugin has a clean interface for OCR backends, so swapping was straightforward:

interface OCRService {
  initialize(): Promise<void>;
  processImage(imageData: ArrayBuffer): Promise<OCRResult>;
  isAvailable(): boolean;
}

// Tesseract implementation
class TesseractOCRService implements OCRService { ... }

// Swap in OpenAI
class OpenAIVisionService implements OCRService { ... }

Set up the API key, pointed it at my notebook photos. Better than Tesseract. Not dramatically better. OpenAI Vision handles handwriting, but my particular combination of fast writing and cramped field notebook pages wasn’t making it easy. Legible enough to be useful about 70% of the time. Not good enough to trust.

Google Cloud Vision won. This one took longer to set up. The credential flow for Google Cloud is always a little more involved than it should be. But the results were noticeably better. Google Vision handles handwriting well, and the confidence scores it returns are actually useful for filtering out the garbage OCR results versus the merely mediocre ones.

The kicker: the free tier is 1,000 OCR units per month. That’s more than enough for my actual usage. The plugin currently supports all three backends, selectable in settings. Tesseract is the offline/free default. Google Vision is what I actually use when I want results I trust.

The Rule Engine: When Your Handwriting Has Structure

This is the part of the plugin I’m most pleased with, and also the part that took the most explaining to Kiro.

The dumb version of OCR import is: scan image → dump text into daily note. That’s fine for random notes. But my notebook has more structure than that. I developed shorthand over years of using these notebooks, and I wanted the plugin to understand it.

My notation system, roughly:

A dash (-) means a plain note. Just information.
An asterisk (*) means it’s tied to an active project. Format: *[Project Name]: thing I need to do
A hierarchical path followed by a colon means it’s a new idea. Format: ideas/software: description of idea

The plugin’s rule engine lets you configure these as regex patterns with templates. Here’s what one looks like:

Pattern: \*\[(.+?)\]:\s*(.+)
Template: "## 2\n\nAdded from notebook import."
Target Note: Projects/1/Tasks.md
Action: insert-content (at end of file)

When OCR text matches *[SomeProject]: some task, it extracts SomeProject and some task as capture groups, renders the template with them, and inserts the result into Projects/SomeProject/Tasks.md. Everything that doesn’t match any rule falls back to the daily note as a bullet point.

The macro system means these rules are configurable in the plugin settings. You’re not stuck with my notation. You can define any regex pattern, any template, any target file or folder, and any of five insertion strategies (beginning, end, before/after a pattern, or under a heading).

The Physical Reality

There’s a gap between the physical notebook and the digital image that’s its own problem, separate from OCR accuracy.

Scanning flat is a trap. If you lay a notebook flat on a flatbed scanner, the scanner sees two pages. OCR then produces text that intermingles left and right pages in the order the lines appear on the scan, which is not the order you wrote them. The output is a word salad of two separate notes. I discovered this after my first real import batch and had to re-photograph everything by hand.

The workaround is either photograph individual pages with a camera or scan one page at a time by folding the notebook back and covering the other page. Camera photos work fine and are faster.

Line breaks are OCR’s favorite lie. When OCR reads a handwritten page, it sees line endings as, well, line endings. Every physical line in the notebook becomes a line break in the output. Notes that span multiple physical lines get chopped up, and the regex patterns that depend on a consistent format start failing on the second line. The plugin does some normalization but it’s imperfect: notes that wrap in the notebook still come out fragmented.

The Part Where the Plugin Wins and I Stopped Using It Anyway

The plugin works. I’ve used it to clear out a backlog of older notebooks: pages that were just sitting there, never going to get manually transcribed, information that would have stayed locked in paper forever. For that use case, it’s great. Take a batch of photos, drop them in the Inbox/ folder, let the monitor run, and an hour later the notes are in Obsidian where they can at least be searched.

But my daily workflow? I stopped feeding it new notebooks almost immediately.

Here’s the thing I didn’t expect: the process of sitting down with a two-week-old notebook and actually reading through it is not the friction I thought it was. It’s the point. When I flip through pages I wrote two weeks ago, I see ideas I’d forgotten about with fresh eyes. The idea that seemed obvious when I wrote it down looks different now that I’ve been thinking about other things. Connections form. An idea from page 4 relates to something I was doing last week that didn’t exist yet when I wrote page 4. A project note that felt stalled when I wrote it suddenly has a new angle.

None of that happens when the notes get automatically inserted into Obsidian without a human in the loop. They go in, they get tagged, they sit in the daily note at the right date, and I never look at them again because the capture already happened and my brain considers it done. The ideas don’t get that second read. They don’t get the benefit of time and distance. They just disappear into the vault.

So I went back to reviewing notebooks by hand, and I wrote most of this post ready to land on a tidy little lesson: manual transcription was the review, and I’d been trying to automate away the one part that mattered.

That lesson is half right. I just had the wrong half.

The Fix Was Talking to a Robot

The thing I’d actually stumbled onto wasn’t “manual good, automated bad.” It was “the review needs a mind in the loop.” The OCR plugin failed not because it was automated but because it pulled my brain out of the process entirely. Scan, route, done: no thinking required, so no thinking happened.

Which raised a question: could I automate the filing, the boring part, while keeping my brain in the capturing?

Yes, and the answer was already running on a server in my house.

I have an AI agent I talk to over Telegram. Not the plugin, but a general-purpose agent with a set of skills, one of which knows how to write to this vault. So now my notebook-to-Obsidian flow looks like this: I open Telegram, hit the voice button, and just talk my notes. Speech-to-text transcribes the ramble, hands it to the agent, and the agent figures out where it goes. Is this an idea, a daily note, a to-do for a specific project, a blog draft? It pulls the vault, drafts the note from the right template, and then, this is the part that matters, reads it back to me and asks before it writes anything.

That last step is the whole game. It’s a conversation, not a capture. And two things fall out of it that hand-review gave me and silent OCR never did:

Speaking the notes is re-remembering them. When I transcribe by hand I copy the line and move on. When I have to say a note out loud, I can’t just copy it. I have to re-explain it to myself. I ramble. And rambling is thinking. Half the time a better version of the idea falls out of my mouth than the four cramped words I scrawled. Saying it out loud grows the thing.

The agent talks back. It asks which project a task belongs to. It notices that what I’m describing relates to a note from last week and offers to cross-link them. It pushes back when I’m vague. The note that lands in the vault is richer than anything that was on the paper, because two minds touched it instead of zero. Then it commits and pushes, and the note syncs back to every device before I’ve set my phone down.

It’s automated and it’s a review, more of one than reading by hand ever was. Hand-review is one tired brain reprocessing old notes. This is two brains building the note up in real time, at the moment I’ve got the most context.

Where It Actually Fits

So I’m not abandoning anything. I ended up with three tools for three jobs, which is more than I set out to build and exactly right.

Old notebooks and backlog → the OCR plugin. I have notebooks going back years full of ideas that are never getting a careful read. The plugin burns through those and gets them into a searchable state. Fire and forget. This is the job it’s genuinely good at.
Notes I want to actually sit with → read by hand. Still valid, still happens. Sometimes the right move is to close the laptop, put down the phone, and just turn pages.
Daily capture → dictate to the agent. This is the new default for anything fresh. Automated filing, human-in-the-loop thinking, a second mind that makes the note better on the way in.

The things about the OCR plugin that are still rough, in case you want to build something similar:

Line break normalization works for single-line entries, breaks down for anything that wraps
The confidence threshold for Google Vision needs tuning per handwriting style: start conservative
Mobile support didn’t make the first version; camera integration on mobile Obsidian is a whole other project

The things worth stealing:

Kiro spec mode for plugin development is underrated. Write the whole spec first, let it plan, build from the plan. The 3,500-line single-file result was architecturally weird but delivered a working plugin without babysitting.
Google Vision’s free tier (1,000 units/month) covers any reasonable personal usage.
The strategy pattern for OCR backends is genuinely useful: Tesseract gets you most of the way for zero cost, and swapping to Google Vision when it matters is two lines of config.
The real one: if you’re automating a personal workflow, figure out which step is secretly doing work you don’t want to lose. Automate around it, not through it.

I built a tool to eliminate the boring part of my workflow and discovered the boring part was a feature. Then I found a way to keep the feature and still kill the boredom. That second realization only happened because the first tool failed in a specific, instructive way.

This is actually one of the underrated advantages of AI-assisted development. Sometimes your dreams are stupid and you don’t know it until you test them. Before, you’d pine away for months waiting for a spare weekend to finally build the thing, and by the time you got there you’d either talked yourself into it being a great idea or life had moved on and you never built it at all. Now you can build it in a day and realize the error of your ways in a few hours or a few days of actual use. The feedback loop went from months to a long weekend. Not that an Obsidian OCR plugin is a life-changing idea (I know what I’m working with here), but the same principle applies to things that actually matter. Build the dumb dream fast, find out if it’s actually dumb, and let what you learn point you at the thing you should have built instead.

Model Buzz Roundup: Week of June 3, 2026

Tue, 09 Jun 2026 08:00:00 -0500

Three of the four scoreboards I trust say MiniMax M3 is the best deal in open-weights AI right now. The fourth says nobody has actually checked.

That gap is the whole story this week: a model topping the usage charts that nobody has independently verified. M3 launched June 1, rocketed up the OpenRouter rankings on a wave of launch hype and a half-price coupon, and landed at the top of the open-weights pile on a serious benchmark. It also shipped without its weights, without a technical report, and without a single Arena vote to its name. So I spent the week doing what I always do: cross-checking the four places that measure models against each other, because any one of them on its own will lie to you.

And while I was untangling M3, Claude Opus 4.8, the model sitting at #1 on raw intelligence, was quietly setting people’s money on fire. More on that below. Let’s go.

MiniMax M3: The Best Model Nobody’s Verified
The Smartest Model Is Also the One Eating Your Tokens
The Cheapskate Picks: Where You’re Actually Wasting Money
The Map Just Redrew Itself
Coming Soon (Allegedly)
What I’d Actually Run This Week

MiniMax M3: The Best Model Nobody’s Verified

Here’s the case for M3, and it’s a real one.

On OpenRouter it jumped to #3 by weekly token volume at 2.89 trillion tokens, with a week-over-week delta my scraper rendered as “>999%,” which is what you get when a model goes from not existing to being everywhere in seven days. On Artificial Analysis it scored 54.7 on the v4.0 Intelligence Index, good for #7 overall and, more importantly, the highest-scoring open-weights model on the board, edging out Kimi K2.6 (53.9) and Xiaomi’s MiMo-V2.5-Pro (53.8). It also sits on AA’s Intelligence-vs-Cost Pareto frontier, which is the chart I care about most, because it answers “is this smart for the money” instead of just “is this smart.” And the pricing is genuinely cheap: $0.30 per million input tokens and $1.20 per million output, roughly a tenth of what the frontier closed models charge.

The demos are wild, too. One agentic evaluation had M3 autonomously optimizing a CUDA kernel from 7.6% to 71.3% hardware utilization for a 9.4x speedup, across 1,959 tool calls over 24 hours with zero human babysitting. VentureBeat ran the headline that it “eclipses GPT-5.5 and Gemini 3.1 Pro on key benchmarks for 5-10% of the cost.” If you only read those two sentences, you’d switch today.

So here’s the part where I ruin it.

Every one of those benchmark numbers is vendor-published. As of this writing, MiniMax has not released the weights or a technical report. AA’s own changelog literally describes it as the “leading open weights model, once the weights are released.” That “open-weights” label is currently a company promise, not a thing you can download and verify. TechTimes called it exactly what it is: “Frontier Claims, Unverified Benchmarks.”

And then there’s the coupon. M3’s launch ran a 50%-off promo on the MiniMax provider through June 7. So that “>999%” OpenRouter spike? Part real curiosity, part “free money expires Sunday.” We’ve seen this movie. Tencent’s Hy3 pulled the same free-period stunt back in May and I got burned calling it a flash in the pan, so I’m not going to pretend the usage is meaningless. But I’m also not going to pretend a discount-driven launch spike is the same thing as adoption.

The tell that keeps me honest is Arena. The Arena leaderboard runs on human head-to-head votes, and M3 is completely absent from the Overall board. Its only appearance anywhere is a single early entry in the Math category at 1487. That’s not a knock on the model. Arena always lags new releases by a week or two because votes have to accumulate, which means the one source that measures lived human preference has no read on M3 yet. Three scoreboards say buy. The fourth says “who?”

One more asterisk for the agent crowd: M3 is the slowest model in the top tier on AA’s speed chart at 41 output tokens per second. If you’re running it in a long agent loop, that slowness compounds into real wall-clock cost. Cheap per token isn’t cheap per task when every task takes twice as long.

My take: M3 is probably real and probably very good. But “probably” is doing a lot of work, and I don’t move my daily driver on a coupon and a press release. Watch for the weights to actually drop and for Arena to fill in. Until then it’s the most interesting model of the week, not the one I’d bet a production workload on.

The Smartest Model Is Also the One Eating Your Tokens

If M3 is the hype story, Claude Opus 4.8 is the inconvenient-truth story.

On the AA Intelligence Index, Opus 4.8 is #1, full stop: 61.4, ahead of GPT-5.5’s 60.2 and everything else. It is, by that measure, the smartest model you can rent right now. On OpenRouter it climbed +199% week-over-week to 1.26T tokens as people pile in roughly two weeks post-launch. So far, so good.

Then you read GitHub issue #64961 on the Claude Code repo, and the picture sours fast. Users are reporting that Opus 4.8 (and 4.7) regressed token usage 2-3x after the update for equivalent work. One logged case: Opus 4.8 on medium effort spent 46,000 output tokens on hidden “thinking” for a simple coding turn. People are also seeing it re-fetch identical tool results 2-3x more often than 4.7, plus frequent disconnects that force a resume-and-retry, which burns even more tokens. If you’re on a five-hour session budget, the smartest model on the planet is also the one quietly chewing through your quota with nothing visible to show for it.

This is the paradox I keep running into lately: peak intelligence and peak cost-efficiency have fully decoupled. The model that wins the benchmark is not the model that wins your invoice. AA’s blended price chart puts Opus 4.8 at $4.10 per million, the priciest of the leaders, before you account for the token inflation on top. You’re paying a premium rate to burn premium volume.

I still reach for Opus when a problem genuinely needs the extra IQ. But “genuinely needs” is carrying weight now, because the default-to-the-smartest-model habit got a lot more expensive this month.

The Cheapskate Picks: Where You’re Actually Wasting Money

This is the part of the roundup I actually use myself, so here’s the method in one breath: take the Arena leader in a category, draw a band 50 rating points below it, and find the cheapest model still inside that band. Arena’s top end is compressed; the whole competitive set usually fits inside 50 points on a 1400+ scale, so “cheapest in band” is a real choice between near-equivalents, not “settle for worse.”

Here’s where that landed this week:

Category	Leader	Cheapskate pick	Pick price (out)	Rating gap	Roughly	AA Pareto
Overall	Claude Opus 4.6 (thinking)	GLM-5.1	$3.08/1M	−29	~8x cheaper	nearby
Coding	Claude Opus 4.6 (thinking)	GLM-5.1	$3.08/1M	−24	~8x cheaper	nearby
Creative Writing	Claude Opus 4.6 (thinking)	Gemini 3 Flash	~$3/1M	−39	~8x cheaper	n/a
Instruction Following	Claude Opus 4.6 (thinking)	MiMo-V2.5-Pro	$0.87/1M	−42	~29x cheaper	✓
Hard Prompts	Claude Opus 4.6 (thinking)	GLM-5.1	$3.08/1M	−34	~8x cheaper	nearby
Math	Gemini 3.5 Flash	MiMo-V2.5-Pro	$0.87/1M	−35	~10x cheaper	✓

Two models do all the heavy lifting here, and neither one is the model everybody spent the week talking about.

GLM-5.1 (Z.ai) takes Overall, Coding, and Hard Prompts. At $0.98 in / $3.08 out it’s roughly an eighth of what Opus costs, and it gives up something like 1.5-2% on the Arena scale to do it. It’s the boring correct answer that refuses to make headlines: top-ten across three categories and not a single viral thread about it.

MiMo-V2.5-Pro (Xiaomi) is the one I want to put a flag on. It wins Math and Instruction Following outright, and in Instruction Following it’s ~29x cheaper than the Opus leader for a 42-point gap. More importantly, it’s confirmed on AA’s Intelligence-vs-Cost Pareto frontier in both categories, meaning two completely independent methodologies (my Arena-band math and AA’s benchmark-vs-price chart) point at the same model. That convergence is the highest-confidence signal this whole skill produces, and it’s pointing at a Xiaomi model trading at 87 cents per million output tokens.

The one inversion worth flagging: in Math, the leader is the expensive one. Gemini 3.5 Flash tops the category at 1519 but costs $9 per million output, a “Flash” model priced like a flagship, which is a rant I already went on last month. MiMo gives you within ~2% of it for a tenth of the price. When your cheapest competitive option is also 10x cheaper than the leader, that’s not a tradeoff, that’s just the answer.

Notice who’s not in this table: MiniMax M3. The breakout of the week isn’t a cheapskate winner anywhere, purely because Arena hasn’t rated it yet. The proven value plays are the quiet models. Funny how that keeps working out.

The Map Just Redrew Itself

Step back from individual models and look at the OpenRouter market-share board, because it’s telling a bigger story than any single launch.

DeepSeek is the #1 author on the entire platform at 19.4% of all tokens, with DeepSeek V4 Flash sitting at #1 overall (4.07T tokens, $0.10 in / $0.20 out, still the cheap workhorse everybody actually runs). Add up DeepSeek, Tencent, Xiaomi, MiniMax, and Qwen and you’ve got more than half of OpenRouter’s token volume flowing through Chinese labs. Anthropic holds second at 15.6%. OpenAI? Down at 6.8%, which for the company that started this whole gold rush is a genuinely striking number.

Into that gap, NVIDIA planted a flag. Nemotron 3 Ultra, a 550B/55B-active MoE, got announced at Computex on June 1 and dropped its weights on Hugging Face June 4. It’s fast (171 output tps, second only to gpt-oss-120b on AA’s chart) and genuinely open, weights and recipes and all. NVIDIA’s framing is “the most capable US-developed open model ever,” and that’s true: it beats Gemma 4 and gpt-oss-120b. But scope that claim carefully: its Intelligence Index of 47.7 lands it a solid six-plus points behind Kimi K2.6, MiMo, and MiniMax M3. The best US open model is still chasing the Chinese open models. That’s the actual state of play in June 2026, and no amount of Computex keynote energy changes it. (It’s also not on OpenRouter yet; DeepInfra and HF only for now.)

If you’ve been ignoring the non-Western labs because the Reddit chatter is thinner over there, this is your reminder that the usage numbers don’t care about your feed.

Coming Soon (Allegedly)

The rumor pile, labeled honestly:

Gemini 3.5 Pro: announced. Google said at I/O (May 19) it’d ship “next month,” which is now. No date, no model ID yet. This is the one I’m actually watching.
Grok 5: rumored, low confidence. 6 trillion parameters on the Colossus 2 supercluster, Q2 window per xAI, but prediction markets give it only about a 33% chance of shipping by June 30. Translation: don’t hold your breath.
MiniMax M3 weights + technical report: committed, not shipped. The single release that would flip M3 from “interesting” to “verified.” Watch for it.
Claude Mythos (Mythos 1): restricted. Still locked to ~50 Project Glasswing partners for defensive cybersecurity work. No general availability, no timeline.
GPT-6: speculation. No announcement, no signal. OpenAI’s public ceiling is still GPT-5.5, which, as I covered above, might be part of why their token share looks the way it does.

What I’d Actually Run This Week

No inspiration porn, just the shortlist:

If you need raw intelligence for a hard problem and you can stomach the bill, Opus 4.8 is the smartest thing going, though watch issue #64961 and keep an eye on your token counter, because it’ll spend 46k tokens thinking about a one-liner if you let it. For the 90% of work that doesn’t need that, GLM-5.1 is the unglamorous ~8x-cheaper answer across general, coding, and hard prompts, and MiMo-V2.5-Pro is the genuine steal for math and instruction-following at 87 cents a million with two independent methodologies vouching for it.

And MiniMax M3 is the model I’m most excited about and least willing to recommend, which is a weird sentence to write but an honest one. Three scoreboards love it. The fourth hasn’t met it. The weights aren’t out. The benchmarks are self-reported. The launch spike rode a coupon. Every individual flag is yellow, not red. I’ve been doing this long enough to know that “probably great, just trust us” is exactly the pitch that’s burned me before. I’ll move my workload when the weights drop and the votes come in. Until then I’ll keep doing the boring thing: cross-checking four scoreboards and reaching for the cheap model that already proved itself.

See you next week, when half of this is wrong and there’s a new stealth model nobody can identify.

My Third Try: How a Living Plan Beat Both Vibe Coding and Spec-Kit

Tue, 02 Jun 2026 08:00:00 -0500

I’ve been thinking about building a project for a while now. Months, actually. Why? Because I already tried to build it a couple of times. The first attempt was pure vibe coding and produced an unarchitected behemoth that veered off-target around week two, but I kept working on it and gave up around week three. The second try was with GitHub’s spec-kit, and I drowned in paperwork before any code ran.

It’s not that the project is important to anyone other than me, but I want it to be right. So in the meantime, I turned parts and pieces of it into skills for Claude Code. And after I realized I was just orchestrating all these skills myself, I started on the project again.

The third attempt is working. The trick is dumber than I want to admit. But it’s the only thing I could think of when I only half-way knew what I wanted it to do.

The Two Failed Attempts (And Why Each Sucked)
- Attempt One: Pure Vibe
- Attempt Two: Spec-Kit and the Waterfall Trap
The Living Plan: One Document, Numbered Decisions
The Daily Sync: When the Knowledge Base IS the Project
Two MCPs, Two Indexes, Two Different Questions
When the Assistant Lied to My Face
And What I’m Building Uses a Completely Different Pattern
What This Process Actually Produces
And I Actually Know What’s In There
The Slow Middle Path
Update: Extending the Living Plan Into the Build

The Two Failed Attempts (And Why Each Sucked)

I’ve written before about both of these in pieces. The great vibe coding experiment covers the part where I leaned all the way into “just build shit and see what happens,” and a few months later I wrote about the part where I burned out on vibe coding, came back, and rewrote everything. Together those two posts are roughly the full arc of “what happens when you treat every project like it’s a weekend hack.”

Attempt One: Pure Vibe

The first run at this project started the way most of my projects start: I opened Claude Code, described the idea in a few paragraphs, and said “let’s build a thin slice.” That works great when you’re building an Obsidian plugin or a CLI tool with one job. It does not work when the project has four loosely-coupled modules that all have to share the same data shape and you don’t yet know what that shape should be.

What I got, about three weeks in, was a sprawling codebase that did approximately one-third of what I wanted, in a way that made the other two-thirds impossible without ripping out the foundation. I should know to expect this by now and I do. But just one more roll of the dice.

On this project, the architecture decisions mattered more than the velocity. Vibe coding optimizes for “let’s see where we end up.” This project needed “let’s make sure we end up in the right place.”

I abandoned it. Not the idea, the codebase. The idea kept coming back.

Attempt Two: Spec-Kit and the Waterfall Trap

A few months later, GitHub had released spec-kit and the discourse had moved on to “spec-driven development.” So I tried it. The pitch is reasonable: write a spec, generate a plan, generate tasks, then build. Front-load the thinking. Don’t let the AI go off the rails.

I wrote the spec. I generated the plan. I generated the tasks. I had a beautiful tree of structured documents that described what I was going to build in extensive detail, even though I was guessing.

And then I sat there. Because the spec-kit output was sized for a project with stakeholders. It’s a stakeholder-management tool dressed up as a planning tool. I am one person. There is no stakeholder. The doc had nobody to satisfy except me, and I kept moving the goalposts on myself. Every section invited another section. Every requirement spawned three sub-requirements. By the time the plan looked “done,” I was tired of the project before I’d written a line of code. And I really was not sure if it was what I wanted, but I didn’t want to change it, because after all the specs were in, it would be like trying to do a 180 in an ocean liner.

Spec-kit is probably the right move if you’re working on a regulated codebase with a real product manager and a real backlog of stakeholder asks. For what I work on in my free time, the overhead eats the energy that’s supposed to fuel the build.

I closed the spec-kit folder. Two attempts down. The idea wouldn’t leave though.

The Living Plan: One Document, Numbered Decisions

Here’s what worked. It is boring and relatively dumb.

I made one file at the root of the repo. I called it PLAN.md. The first thing that went into it was this:

# Living Planning Doc

> Living document. Built up across multiple planning sessions. Not a finalized
> spec, just a running record of decisions made, context that matters, and open
> questions still to resolve. Append, refine, don't rewrite.

## Why this document exists

I've spent months stuck on this project because the research is too large
to consolidate alone. The goal of these planning sessions is to get the idea
right before building anything, not to optimize for build speed. This file
is the persistent memory that survives between sessions so prior decisions
don't get re-litigated.

That opening section was the thing that unblocked me. Naming the fear out loud turned out to be more useful than any of the architecture decisions that came after it. Because I’m an idiot, I’d been treating planning as something separate from building, instead of admitting that the planning was the part I was actually failing at.

Below that header, the doc has three kinds of entries. Decisions numbered D1, D2, D3. Open Questions numbered Q1, Q2, Q3. And a Parking Lot of items numbered PF1, PF2, PF3: the stuff I’m not working on right now but that popped up during a session and I don’t want to forget.

Here’s what an entry looks like:

### D5: Per-workspace knowledge store

Each unit uses a two-folder pattern inspired by Karpathy's LLM wiki idea:

- `raw/`: immutable dump zone. Source material stays whole. No chunking,
  no preprocessing. Some of this content depends on properties that
  chunked retrieval destroys, so it has to be loaded whole.
- `wiki/`: agent-maintained structured entity pages built from `raw/`.
  Knowledge compounds across sessions instead of being re-derived from raw
  sources every time.

This replaces an earlier sketch that mirrored the repo's dual-layer KB into
each unit. Those indexes work for the repo's unstructured research, but they
mismatch the per-unit data shape (small, curated, handpicked).

And here’s what an open question looks like before it’s answered:

### Q7: Round-trip, where does the human-edited final version live?

AI-generated drafts land in `runs/`. Human-edited finals need to land back
in `raw/` so the next iteration can learn from them. Open questions:

- Is the front-matter contract honored on output, or applied on ingestion?
- How do we mark a file as "AI-touched" vs "human-only" without forcing
  manual labeling of legacy material?
- Do we need a separate retrospective stage?

And here’s the third kind: a parking-lot item. This is the one I almost left out of this post, because it came later:

### PF4: Per-client style overrides that learn from my edits

Not building this now. Far enough out that the shape will probably
change before I get there. But: when I hand-edit a generated draft,
the edits are signal. Eventually the per-client rules should learn
from the round-trip, catching the same gotcha next time instead of me
fixing it by hand every run. Direction, not a decision.

Here’s why that section exists: I get worried that if I don’t write an idea down the second I have it, I’ll lose it. Not “might.” Will. It happens to me constantly. So I could have done the roundabout thing I used to build the project’s knowledge base: dump the idea into my Obsidian vault and let the sync script drag it over into the repo eventually, where it’d surface in some future session. That works, but it’s a long way around for “don’t forget this.”

The parking lot is the shortcut. I say it out loud mid-session and Claude drops it into the doc: out of my head, into the same file the actual work lives in. And I stop carrying it. I know it’s written down somewhere I’ll see it, somewhere that’s in the pipeline of things that are going to happen, so the part of my brain that was anxiously holding onto it can let go.

These aren’t decisions, and they’re not even questions I’m ready to sit down and answer yet. They’re halfway decisions. Directions. Things far enough out that they’ll probably look different by the time I touch them, and that’s fine. A PF item that’s never urgent just sits there, harmless, no longer renting space in my head.

The protocol is just as simple. Each session opens with “what’s the next open question on the list,” I work through one (sometimes two), and at the end Claude Code appends the resulting decision back to the doc: promoted from Q7 to D7. The question gets struck through but stays in the doc as historical record. Nothing gets re-litigated unless I explicitly reopen it. The parking lot feeds the top of that same funnel: when a PF item finally gets ripe, it graduates into a Q I answer, and the answer becomes a D. PF to Q to D, or it never moves and that works too.

This is the part that does the work that I thought spec-kit would do for me, with 10x less ceremony. Past-me argued the case. Future-me has to honor the decision or explicitly overturn it. Maybe there’s a way to use spec-kit for a project that morphs as it develops, but this works for me.

The Daily Sync: When the Knowledge Base IS the Project

There’s one more thing the planning doc depends on: a sync script.

Every day-ish, I run this from the repo root:

python sync.py

What that script does: it pulls fresh research notes from my Obsidian vault into a research/ folder in the repo, and re-indexes the knowledge base. Takes maybe thirty seconds. The reason I run it daily is that I’m constantly clipping articles, writing fragments, and capturing prompts into Obsidian during the rest of my day. If the repo’s index lags, my planning sessions can’t see what I already know.

The research folder has grown into the most valuable part of the repo. There are sixty-plus clipped articles in there now, organized by topic. There are skill definitions from earlier experiments I want to reference. There are book highlights I exported from Kindle. There are papers.

That’s the corpus the planning sessions reach into when I open a question.

Two MCPs, Two Indexes, Two Different Questions

The repo runs two MCP servers during planning sessions. Both are pointed at the same research/ folder. They index it two different ways.

{
  "mcpServers": {
    "prose-kb": {
      "command": "uv",
      "args": ["run", "--project", "kb", "python", "kb/server/prose_mcp.py"],
      "env": {}
    },
    "graphify": {
      "command": "bash",
      "args": ["kb/graphify_serve.sh"],
      "env": {}
    }
  }
}

The first one is a semantic chunk search. It splits prose into chunks, embeds them, and lets the assistant query “find me passages about X.” If I half-remember reading something months ago about, say, the way constrained generation interacts with structured output, prose-kb is the tool that surfaces the paragraph. It answers the question “where did I write down the thing about this?”

The second one is a knowledge graph. It walks the corpus, pulls out entities and the relationships between them, and clusters them into communities. It exposes tools like get_neighbors, query_graph, shortest_path, and get_community. Where prose-kb is great when I remember reading about something, graphify is great when I don’t know what to ask. You start at a known concept and walk outward. “What’s connected to this idea? What cluster does this belong to? What’s a few hops away?”

They’re not redundant. Chunks and graphs answer different shapes of question, and you can’t fake either with the other.

Here’s the kind of session where I actually need both. Last week I had an open question about how aggressive a particular processing step should be: should it transform aggressively or just trim? I asked graphify for the community of concepts around “post-processing” in the research corpus. That surfaced a cluster of nodes I hadn’t realized were related, including some passages from a book I’d clipped six months ago. Then I asked prose-kb to pull the actual paragraphs from those clipped sources. Two queries, two lenses, one decision recorded back to PLAN.md. Without graphify I wouldn’t have known to ask. Without prose-kb I’d have gotten a summary instead of the actual passages.

When the Assistant Lied to My Face

Now a sidetrack, because these type of things don’t happen as often any more, so worth bringing up.

A few weeks back I noticed there were two graphify-out folders in the repo. One in the project root, one nested inside research/. I asked Claude about it. Got told: “The one in the root is from an earlier configuration. It’s vestigial. The active one is in research/. You can ignore the root one.”

Cool. Moved on. Came back two sessions later. Both folders had fresh data in them. Asked again. Got told again that the root folder was harmless and the active one was the nested one.

Third time I just went and read the shell scripts myself. There was a stale path in one of the KB rebuild scripts. The script was writing to both folders. The “harmless” folder wasn’t harmless; it was getting half my graph data while the MCP server was serving the other half. The fix took ten minutes once I actually looked instead of accepting the second-hand reassurance.

“You told me there was only one Graphify out folder in use. The other was left behind. But something is still writing to both.”

That’s what I typed when I caught it. The lesson is the lesson, and I should already know it: the planning doc and the MCPs are tools. The two MCPs work. The assistant is smart. None of that makes any of them infallible. When something feels off, go look at the actual file. Don’t accept the reassurance, especially the second time you’ve heard it. Especially the third.

And What I’m Building Uses a Completely Different Pattern

Here’s the part I didn’t see coming when I started this.

The repo I just described, the one with prose-kb plus graphify indexing the research corpus, is the planning environment. It’s where the architectural decisions get made. But the thing the project actually produces organizes its data a third way, and the third way is neither of the two MCPs.

Each unit of work the pipeline creates is structured like this:

some-workspace/
├── raw/
│   ├── article-2024-03-12.md
│   ├── article-2024-08-19.md
│   ├── notes.md
│   └── ...
└── wiki/
    ├── overview.md
    ├── style-guide.md
    └── entities/
        ├── concept-a.md
        └── concept-b.md

The raw/ folder is a dump zone. Whatever source material the unit needs lives in there, files intact, not chunked, not embedded. The wiki/ folder is curated structured pages, built by an agent that reads from raw/ and writes to wiki/. The idea comes from Andrej Karpathy’s LLM wiki concept: small, handpicked, agent-maintained knowledge that compounds across sessions.

I deliberately did not mirror the prose-kb-plus-graphify pattern into each workspace. Here’s why.

The corpus inside each workspace is small. We’re talking dozens of files at most, hand-picked, high-signal. Heavy indexing isn’t paying for itself at that scale: the assistant can just read the wiki page.

More importantly, some of the content inside raw/ depends on properties that chunked retrieval destroys. Specifically, properties of the prose itself, like rhythm, cadence, and paragraph structure, that only survive if you load the file whole. Embeddings are great for “find me a thing.” They’re terrible for “preserve the texture of how something is written.” If I’d reused the dual-index pattern inside each workspace, I’d have lost the very thing the workspace exists to capture.

This is the sort of decision that would have come out wrong in either prior attempt. Vibe-coding-me would have used the indexing pattern that was already on the table because it worked at the repo level, and discovered the problem six weeks later when the output was bad. Spec-kit-me would have written a fifteen-page rationale for the choice and forgotten what the original problem was halfway through.

Living-plan-me raised Q5, talked through the trade-offs with the MCPs as backup, and made the call: Claude wrote it up as D5. Maybe ninety minutes from “this is a question” to “this is the answer.” The answer might still be wrong, but it’s explicit and it’s findable and the next time future-me wonders why we did it this way, the doc tells him.

Pick the data shape that matches what you’ll actually do with the data. Don’t reuse a pattern just because it worked somewhere else in the same project. Three different knowledge-organization strategies in one codebase, each chosen for its specific job. None of them is universally right. The fashionable choice is rarely the right one, and the right one is often the one you’d find boring.

What This Process Actually Produces

PLAN.md has been touched sixteen times in the last thirty days. More than any code file in the repo. The plan is the most active artifact.

The doc is the deliverable. Code is a byproduct of decisions being made well. It feels backwards, because vibe coding rewards code-as-output. But it’s the same thing spec-kit was trying to enforce, except the doc is allowed to be uneven and grow and you’re allowed to leave a Q7 open and walk away for three days before answering it.

The decisions stack up. The Q list shrinks. Sometimes a new question pops up because an earlier one got answered in a way that opened it. That’s fine. The doc is allowed to grow.

And I Actually Know What’s In There

Here’s one thing I wasn’t tracking when I started doing this. I understand the codebase.

Not in the “I wrote it last week and it’s fresh” sense. In the “I can tell you why any load-bearing decision is the way it is, and which D# it traces back to” sense. Months in. The Q-to-D-to-code pipeline produces working software as one output and a builder who actually understands his own code as the other.

Compare that to where vibe coding lands you. You get a working thing for a while. You also get a codebase whose decisions you didn’t make explicitly, only half of which the model still remembers, and the model will cheerfully reassure you about all of it.

The interaction model fixes this almost by accident. Every D# in the plan got argued for. I sat in it. The model pushed on a position, I half-agreed, I changed my mind in the next exchange, and then it became a numbered decision. By the time code shows up to enact it, the rationale is already loaded into the part of my brain that has to maintain it.

The one soft spot is when a module has been quiet for a few weeks and I have to go re-touch it. I know I made the decisions, but the texture blurs. So I added learning-opportunities to my global Claude Code skills. When I’m about to change a file I haven’t touched in two weeks, twenty minutes with that skill pointed at it puts me back where I was when the original calls got made. And some days I just point it at what I built that day and let it walk me back through the choices I made: to get a layer of depth out of decisions I’d otherwise just move past.

The Slow Middle Path

Could this approach still fail? Sure. Third attempt could become fourth attempt. Some of the decisions I locked in early might turn out wrong and force a rewrite.

But moving from “stuck for months” to “moving forware” is good, and that’s what I came back for. The first attempt taught me that vibe coding doesn’t work on projects where the architecture matters more than the velocity. The second attempt taught me that spec-kit costs more than it’s worth for one person on one project. The third one is showing me what the middle path actually looks like, and the middle path turns out to be one document, two MCPs, and the discipline to get every decision written down as D7 instead of trying to remember what we decided last Tuesday.

One honest scope note before you run off and try this. The reason it works this well for me is that I’m not building toward a spec somebody handed me: I started this not fully sure what it needed to do by the end, and I still move the destination as I learn. That’s the part I like most: it’s an exploration tool more than a planning tool. Spec-kit assumes you already know what you’re building and the job is to pin it down precisely enough that nobody drifts off it. This is the opposite situation: one person who doesn’t know yet what the thing should be, using the doc to think his way toward it. I wouldn’t run a team’s production roadmap this way; that’s exactly where spec-kit’s ceremony earns its keep. But for figuring out whether a thing should even exist, and what it is once you decide it should? It works great.

Update: Extending the Living Plan Into the Build

A few weeks after I wrote everything above, I did the obvious thing: I pointed the whole approach at a second project. Different domain entirely: a knowledge-graph-based site. Same setup, though: research two inches deep, the same paralysis, the same PLAN.md at the root of the repo with its D# decisions and Q# questions. It worked again.

But this was another project I started on slowly, again because I wasn’t sure where it was going, and it actually had some working code. But the process above ends on a tidy line: the doc is the deliverable, code is a byproduct. Then conveniently stops before answering the question that actually matters: okay, so who writes the code, and how do you hand the plan off without dragging fifteen hundred lines of decisions along for the ride?

I’ll get to how that works. It grew with the process. But first, how I do planning, because plan mode only works for me when I have a complete idea.

Always Opus, Never Plan Mode

I plan in Opus. Only Opus. And Claude Code has an official plan mode that I never touch: not for this, not really for anything open-ended. Both of those come down to one rule: the planning is the conversation, and I won’t put anything between me and the back-and-forth that makes it work.

Start with Opus, because it’s the easy one. This isn’t brand loyalty. I tried to build this exact second project a year ago by opening Claude Code and just telling it to go. It built something. It did not build what I’m building now, not close. The difference isn’t the model’s coding ability. It’s that planning is a thinking activity, and the back-and-forth (me half-forming a position, the model pushing on it, me realizing I was wrong three exchanges in) is the entire mechanism. You don’t get that from a model racing to the answer. You get it from one that’ll sit in the question with you.

And on the Claude Code Pro plan, the math is friendlier than I expected. One planning session ran close to three hours, all in one chat, and I came out of it at 62% of my usage and 21% of context on a million-token window. Three hours of hard thinking for two-thirds of a day’s budget. Opus-for-all-planning is just affordable, so I stopped agonizing about it.

Plan mode is the same rule pointed at the interface instead of the model. It isn’t the back-and-forth I described up top. It tends to collapse the conversation into multiple-choice questions. Pick A, B, or C. And the problem with that, for a decision that’s still half-formed, is that the options are never quite fit the thing in my head. The questions miss points. The choices aren’t narrow enough. I end up arguing with the menu instead of answering it: which means I’m chatting anyway, just chatting against a format that’s fighting me.

I’ve got a clean example from the second project, and it’s a good one because it shows the cost. We were deciding the site’s whole positioning. The assistant, being helpful, served me a multiple-choice. Under that format pressure I picked “aggregator,” because it was the closest box. It wasn’t right. It was just the least-wrong option.

Then I stopped and typed something like: maybe we should just discuss this instead of you handing me multiple-choice. I always get stuck on those because I’m between two of them and it feels like slapping a label on something that doesn’t have one yet. So we talked it through and landed somewhere completely different and obviously better. That’s the decision that got written into the plan. The pick from the multiple-choice would have quietly steered weeks of work in the wrong direction.

The menu makes you commit before you understand. The conversation is the part that does the work and the menu is optimizing away the only step that mattered.

The Second File: TASKS.md

In the original setup there’s one file, PLAN.md, and it does everything. The extension is that when the plan is solid enough to act on, Opus writes a second file: TASKS.md.

PLAN.md is the why. Decisions, rationale, the argument past-me had with himself. Append-only. Numbered. Never re-litigated.
TASKS.md is the what. Active execution state, and nothing else. It gets recreated at build kickoff. I just deleted the old one, because its “what happened and why” job now belongs to PLAN.md.

And the constraint I put on TASKS.md: it has to stand on its own. The header I make Opus write into it says so:

> Self-contained execution file. Each task is written so a doer (e.g. a
> Sonnet or Haiku subagent) can pick it up and execute without loading
> PLAN.md. Every decision-specific fact a doer can't derive on its own is
> inlined here. The trailing (ref: D#) tags point to PLAN.md decisions for
> human / orchestrator traceability only. A doer can ignore them.

So the top of the file is a “Shared facts” block that inlines everything a cold reader would otherwise have to go digging in PLAN.md for: the positioning the whole thing gets judged against, where the code and the database live, the rules every task has to follow. Then each task says what to do, with a little (ref: D5) breadcrumb back to the decision that justified it. The breadcrumb is for me, not for whoever executes the task. They are told to ignore it.

The Handoff Works, and the Plan Never Goes Quiet

When I started this article, I hadn’t tried the two document plan I just described yet. But it worked and it had a bonus.

The split holds exactly like the design said it would. Opus keeps PLAN.md and orchestrates. The build goes to Sonnet in its own context. It only reads TASKS.md, which stands on its own by design, does its file work over in its own window, and hands back a summary. The main Opus thread stays whole and cached, which was the whole reason I went with subagents over clearing context in the first place: clearing context mid-conversation wrecks your prompt caching, but a subagent isn’t cleared context, it’s separate context.

While Sonnet is grinding through a task in the background, I’m not sitting there watching a progress bar. I’m still in the foreground with Opus, working the next open question. The doer builds while the planner keeps planning. Q8 gets answered and written down as D8 in the same stretch of time Sonnet spends turning D5 into actual code.

I’d been treating it as two phases in strict sequence: plan until the plan is solid, then hand it off and build. And the first half of that is still true. On a project with no code yet, you do plan first, because there’s nothing to build from until the decisions exist. What I had wrong was the second half. Once the build starts, the planning doesn’t stop. It runs alongside the build, because the build doesn’t occupy me. It occupies a subagent.

Trust but verify is still the rule and still load-bearing. A summary tells me what Sonnet meant to do, not always what it did, so Opus reads the files the doer actually wrote before anything gets integrated. I’m not skipping that, because I’ve already seen how that movie ends a few sections up.

But the headline holds. PLAN.md is the deliverable, code is the byproduct, and now the byproduct gets built in the background while I stay in the foreground making the decisions that produce it. Three attempts to get here. The first had me building with no plan. The second buried me in a plan I couldn’t build from. The third writes the plan and builds from it at the same time, and I finally get to be the one person in the room whose only job is to think.

But of course, that was not the end. Read more in part 2 of the living plan saga.

Anthropic Shipped Its Smartest Model Yet — and Made It Easier to Hijack

Sat, 30 May 2026 02:00:00 -0500

I updated Claude Code to Opus 4.8 the morning after it dropped, the same way I update everything: without reading the patch notes. Anthropic shipped it on May 28, 41 days after Opus 4.7, which is a fast turnaround when you remember that basically nobody loved 4.7. The launch post is wall-to-wall “reliability” and “honesty.” Sounds great. I’m a sucker for a model that lies to me less.

Then I read the part of the benchmark table that wasn’t in the headline. The prompt-injection number didn’t get better with this “honesty” release. It got worse. So here I am, writing this week’s roundup with the model that just became measurably easier to hijack, feeling real good about my life choices.

That’s the theme this week, honestly. Don’t read the launch posts. Read the benchmarks the launch posts don’t link to. Let’s get into it.

Opus 4.8: The Reliability Upgrade That Got Easier to Hijack
Following the Money: What People Actually Run
The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)
Hype vs. Value
The Cheapskate Picks
Horror Stories
On the Horizon
What This Week Tells You

Opus 4.8: The Reliability Upgrade That Got Easier to Hijack

Let me be fair before I get snarky, because the model is genuinely good.

Opus 4.8 took the #1 spot on Artificial Analysis’s Intelligence Index at a score of 61, edging out GPT-5.5 in its highest reasoning mode (60). It posts 88.6% on SWE-bench Verified and 69.2% on the harder SWE-bench Pro. On GDPval-AA it hits 1890, up from 1753 on 4.7. On OSWorld-Verified (the computer-use benchmark, the one that actually matters if you’re letting a model click around) it lands 83.4%, a real jump over 4.7.

And here’s the part I actually care about as a daily Claude Code user: it’s roughly four times less likely than 4.7 to let a coding flaw slip through unflagged. That was the whole problem with 4.7. It would confidently barrel ahead with broken code instead of stopping to say “hey, this might be wrong.” 4.8 is the patch for that specific personality defect. Same price, too: $5 per million input, $25 per million output, identical to 4.7, so there’s no migration tax. The new fast mode is $10/$50, and Anthropic claims it’s roughly 2.5× faster than the old fast mode.

Now the snark.

This release is wearing a “honesty and reliability” t-shirt, and underneath it, the Gray Swan prompt-injection number went from 6.0% on 4.7 to 9.6% on 4.8. Higher is worse. If you’re running agentic pipelines over untrusted input — scraping the web, processing user-submitted tickets, anything where the content isn’t yours — the model marketed as the safe one is the one that got more hijackable. That’s the kind of thing that doesn’t make the launch slide.

There’s also the new Dynamic Workflows tool in Claude Code, which decomposes a task into parallel subagents on the fly. Cool feature. Also a feature that “consumes substantially more tokens than typical sessions.” Turn it loose on a big job without a budget and the invoice does its own little dynamic workflow.

One more asterisk: 4.8 is completely absent from the Arena leaderboard right now. Not because it’s bad: because it’s a day old and nobody’s voted on it yet. Arena under-indexes new models hard, so when you see the new hotness missing from the head-to-head rankings, that’s a freshness gap, not a quality verdict. Give it two weeks.

Following the Money: What People Actually Run

If you only look at quality leaderboards, you’d think this is a three-lab race between Anthropic, Google, and OpenAI. Then you look at what people are actually paying to run, and the picture flips.

On OpenRouter this week, the #1 model by token volume is DeepSeek V4 Flash at 3.53 trillion tokens, up 17%. It costs about $0.10 per million in, $0.20 out. That’s not a typo. The most-used model on the platform is a Chinese open-weight model priced like a rounding error.

Claude Opus 4.7 jumped +73% to #3 (2.64T tokens): that’s the launch churn, everybody touching Opus right as 4.8 landed. But here’s the number that actually tells the story: by author, Anthropic holds 18.7% of all OpenRouter traffic and DeepSeek holds 18.0%. They’re neck and neck. The premium lab and the cheap-open lab are splitting the platform down the middle.

And notice what’s not on the Arena overall leaderboard: DeepSeek. At all. Top 25 and it’s not there. That’s not because DeepSeek V4 Flash is bad: it’s the Arena/Reddit-skews-Western blind spot. Flash is a workhorse, not a show pony. The people running it have it wired into a “Pro plans, Flash executes” pipeline: use the bigger model to design the approach, hand the grunt implementation to Flash. One developer’s summary that stuck with me: it “replaced Sonnet 4.6 as my executor — fast, decent results,” with the caveat that it’s “too shallow for complex decisions” and you have to be specific or you get vague output. That’s not a model topping a leaderboard. That’s a model people actually use, quietly, all day.

The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)

The single biggest mover this week is MiMo-V2.5-Pro, Xiaomi’s model, which jumped +475% to #9 on OpenRouter. The catalyst: Xiaomi open-sourced the weights under MIT. It’s a 1-trillion-parameter mixture-of-experts model with 42B active per pass, a 1M-token context window, priced at $0.43 in / $0.87 out.

Yes, the phone company. The one in your friend’s pocket. It scored 54 on the AA Intelligence Index (tying Kimi K2.6, ahead of GLM-5.1’s 51) while costing 87 cents per million output tokens. That’s a lot of measured intelligence per dollar; it’s smarter than its price tag has any right to be. It’s under-voted on Arena because, again, the enthusiast crowd skews toward the Western labs, so the quiet around it isn’t “meh,” it’s a measurement gap.

Hold onto MiMo. It’s about to win two categories outright.

Hype vs. Value

Quick gut-check on what’s overcooked and what’s underrated this week.

Probably hype (for now):

Opus 4.8 — I know, I just spent a whole section praising it. It’s real. But it’s one day old, absent from Arena, and shipping a worse adversarial-robustness number under a “safety” banner. The launch-day glow is doing a lot of work. Respect the brain, verify before you trust it in a loop.
Owl Alpha — the stealth model on OpenRouter, still sitting at #5 (1.38T tokens) and still nobody’s confirmed who built it. It’s been ~31 days. Remember when stealth models got unmasked in two weeks? Polaris turned out to be GPT-5.1, Sherlock turned out to be Grok 4.1, both inside a fortnight. That clock is dead now. Owl’s free, it’s got a 1M context, it’s tuned for agentic work: and reviewers keep running into a “speed tax,” where it’s capable but slow. Also, free means the provider logs all your prompts to improve the model. Free is never free.

Under-sold value:

MiMo-V2.5-Pro — covered above. AA index 54 at 87 cents, open weights, +475% real usage, barely a whisper on Arena.
DeepSeek V4 Flash — #1 by volume for weeks, near-invisible on the preference leaderboards, because the people who depend on it are shipping, not posting hot takes.

The Cheapskate Picks

This is the part I actually do the math for, because it’s the part that saves you money.

Here’s the thing about the Arena leaderboard nobody says out loud: the top is compressed. In the overall category, the #1 model sits at 1502 and #25 sits at 1466. That’s a 36-point spread across the entire visible top end of a ~1400-point scale. Which means the “best” model is often only marginally ahead of something 8 to 30 times cheaper. So the move is: anchor on the category leader’s rating, draw a band 50 points down from it, and pick the cheapest model still inside that band. You give up a rounding error of quality and you keep most of your money.

Here’s how that shook out this week. (Prices are OpenRouter output dollars per million tokens.)

Category	Leader	$ leader	Cheapskate pick	$ pick	Δ rating	Price ratio	Pick’s AA Index
Overall	Opus 4.6-thinking	$25	Gemini 3 Flash	$3	−29	~8×	—
Coding	Opus 4.7-thinking	$25	GLM-5.1	$3.08	−28	~8×	51
Creative Writing	Opus 4.6-thinking	$25	Gemini 3 Flash	$3	−38	~8×	—
Instruction Following	Opus 4.6-thinking	$25	MiMo-V2.5-Pro	$0.87	−41	~29×	54
Hard Prompts	Opus 4.6-thinking	$25	GLM-5.1	$3.08	−34	~8×	51
Math	Gemini 3.5 Flash	$9	MiMo-V2.5-Pro	$0.87	−38	~10×	54

A few things jump out.

MiMo is the MVP. It wins Instruction Following and Math outright on a value basis, and the Instruction Following trade is the best deal in the whole roundup: you’re 29× cheaper on output for a 41-point rating gap, which is under 3% of the scale. And MiMo isn’t just an Arena artifact: it also scores 54 on AA’s Intelligence Index, which is built on hard benchmarks and ignores crowd preference entirely. Two completely different measurement styles (crowd votes on Arena, objective benchmarks on AA) landing on the same cheap model is about as high-confidence as a recommendation gets.

GLM-5.1 owns the coding-flavored categories. It’s within 28 points of the best coding model on the board for about 8× less ($0.98 in / $3.08 out). One caveat worth knowing: it’s a 203K context window, not the 1M the flagships give you. If you’re feeding it a giant monorepo, that matters.

Gemini 3 Flash holds the generalist slots (Overall and Creative Writing) at $0.50 in / $3 out. In the Overall band it actually out-ranks Claude Sonnet 4.6, so the cheap option is beating the mid-tier from a pricier lab.

And the weird one: in Math, even the leader is cheap. Gemini 3.5 Flash, a value-tier model at $9 output, is the outright #1 in the Math category. So if you want the absolute top math model, you’re not even paying flagship prices. The cheapskate floor below it is MiMo at 10× less. Math is the rare category where there’s just no reason to reach for a $25 flagship at all.

Every category this week had a sub-$3.10 option inside the competitive band. There was no “you’re just paying for quality here” category, which doesn’t always happen. Good week to be cheap.

Horror Stories

Every roundup needs its hall of shame. This week’s mostly comes from the launch I opened with.

The reliability release that got easier to hijack. Opus 4.8’s headline is honesty, but its Gray Swan prompt-injection success rate climbed from 6.0% (4.7) to 9.6% (4.8). If your agent reads untrusted input, the “safer” model is the more hijackable one. Read the adversarial benchmarks, not the press release.
Dynamic Workflows, dynamic bill. The shiny new parallel-subagent decomposer in Claude Code burns substantially more tokens than a normal session. Powerful, but budget it before you scale it, or the feature optimizes your spend in the wrong direction.
Owl Alpha’s speed tax. Free, 1M context, agentic-tuned, ~31 days into a stealth run with no provider reveal; and reviewers keep hitting slow throughput, while the provider quietly logs every prompt you send. Nothing free is free; sometimes you pay in latency and data.

On the Horizon

What’s coming, with the appropriate amount of salt:

Gemini 3.5 Pro / Gemini 3.2 — rumored, June. Google ships on a quarterly cadence and 3.5 Flash already landed May 19, with the Pro tier apparently slipping. Treat the date as a pattern guess, not a promise.
Grok 5 (xAI) — announced, in training. Reportedly 6 trillion parameters, training on the Colossus 2 supercluster (1GW scaling toward 1.5GW), which would make it the largest publicly disclosed model. Q2 target.
Claude Mythos — restricted preview. Anthropic’s high-ceiling model, limited to Project Glasswing partners for defensive cybersecurity, with eye-watering reported benchmark numbers. The one to watch, if you can get near it.
GPT-6 — speculation. Codename chatter, late-2026 expectations. Nothing solid.
Step 3.7 Flash and Grok Build 0.1 — live now on OpenRouter (showed up around May 20), not yet ranking. Worth a look if you collect models like I do.

What This Week Tells You

No inspiration porn, just the honest read.

The gap between what tops the leaderboard and what you should actually pay to run has never been wider. Opus 4.8 is, genuinely, the best brain on the board right now: and also a thing you shouldn’t hand untrusted input without thinking about it first. Meanwhile a phone company is giving away a model that wins two value categories outright, and the most-used model on OpenRouter costs ten cents a million tokens.

If you take one thing from this week: stop defaulting to the flagship for everything. Anchor on the leader, find the cheapest thing 50 rating points behind it, and pocket the 8-to-29× difference for the jobs that don’t need a genius. Save the flagship for the work that actually does. And even then, check what it does when somebody feeds it something nasty.

Stephan Miller

The Living Plan Got Fat: Compacting a Doc That Won't Stop Growing

The Good Problem

Progressive Disclosure, Which I Was Already Doing Everywhere Else

What “Cooled” Actually Means

The Payoff: 28,357 Words Down to 8,332

It’s a Process

Why It Became a Skill: The Other Project Running This

The Skill Is a Router, Not a Script

The Honest Scope Note

What’s Next

Model Buzz Roundup — Week of June 24, 2026

The Gate Spread to the Whole Frontier

Export Controls Failed Their First Real Test This Week

The Model Nobody Can Switch Off (Still GLM-5.2)

Cheapskate Picks: Best You Can Actually Run

Horror Stories from the Wild

Where This Leaves You

The Obsidian Plugin Collection I Built One Free Kiro Credit at a Time

The Free Tier Economy of AI Coding Tools

The Monthly Grind

The Plugin List

Apple Books Annotation Import: the one that I started with

Joplin Portal: the one Kiro broke, then fixed

Obsidian Cleaner: the one that suffered scope creep

Daily Prompts: the one that’s almost there

Notebook OCR: the one that worked well and I stopped using

YouTube Auto Video Summarizer: the one I forked

Tag Explorer 3D: the one that doesn’t have a git repo yet

What the Process Actually Looks Like (a Kiro Horror Story)

The Real ROI

Model Buzz Roundup — Week of June 17, 2026

The Week the Government Unplugged the Number One Model

The Vaporware Twins: GPT-5.6 and Gemini 3.5 Pro

The Model Nobody Can Switch Off

Cheapskate Picks: Best You Can Actually Run

Horror Stories from the Wild

Where This Leaves You

The Agent Skills Guide I Wish I'd Had

What a Skill Actually Is (and the Three Things It Isn’t)

The Mistake Everyone Makes First: Treating It Like a Shell Script

The Part That Saves You Tokens: How Claude Code Loads Skills

Building Your First Skill in Claude Code

The description is the hardest line you’ll write

Let Claude write it, then cut hard

The gotchas section is the whole game

The Folder Is the Feature

A Claude Code-specific trick: skill-scoped hooks

Skills Actually Worth Building

The Other Guys: Skills Everywhere Else

The open-source agents

Codex CLI (OpenAI)

Cursor

GitHub Copilot

The commercial top three

The skill cheat sheet

How Good Skills Actually Get Built

Organize Before It Becomes a Swamp

The Skills I Actually Reach For

Lessons I Had to Learn the Hard Way

Don’t trust a skill’s first trial. Build a way to catch the ones that rot

Revisit your descriptions. Treat global ones completely differently

The Honest Version

Model Buzz Roundup: Week of June 10, 2026

Table of Contents

The Beast Arrives

…And Then the Wheels Came Off

And Then the Government Pulled the Plug

The $50 Question

Meanwhile, in Cheapskate Land

The Usage Chart Disagrees With Everyone

Coming Soon

The Takeaway

I Built an Obsidian OCR Plugin for My Notebooks, Then Started Talking to OpenClaw Instead

The Hardware Setup Nobody Asked For

Spec Mode With Kiro: Let AI Design the Damn Thing

The OCR Backend

The Rule Engine: When Your Handwriting Has Structure

The Physical Reality

The Part Where the Plugin Wins and I Stopped Using It Anyway