Stephan Miller
My Third Try: How a Living Plan Beat Both Vibe Coding and Spec-Kit

My Third Try: How a Living Plan Beat Both Vibe Coding and Spec-Kit

I’ve been thinking about building a project for a while now. Months, actually. Why? Because I already tried to build it a couple of times. The first attempt was pure vibe coding and produced an unarchitected behemoth that veered off-target around week two, but I kept working on it and gave up around week three. The second try was with GitHub’s spec-kit, and I drowned in paperwork before any code ran.

It’s not that the project is important to anyone other than me, but I want it to be right. So in the meantime, I turned parts and pieces of it into skills for Claude Code. And after I realized I was just orchestrating all these skills myself, I started on the project again.

The third attempt is working. The trick is dumber than I want to admit. But it’s the only thing I could think of when I only half-way knew what I wanted it to do.

The Two Failed Attempts (And Why Each Sucked)

I’ve written before about both of these in pieces. The great vibe coding experiment covers the part where I leaned all the way into “just build shit and see what happens,” and a few months later I wrote about the part where I burned out on vibe coding, came back, and rewrote everything. Together those two posts are roughly the full arc of “what happens when you treat every project like it’s a weekend hack.”

Attempt One: Pure Vibe

The first run at this project started the way most of my projects start: I opened Claude Code, described the idea in a few paragraphs, and said “let’s build a thin slice.” That works great when you’re building an Obsidian plugin or a CLI tool with one job. It does not work when the project has four loosely-coupled modules that all have to share the same data shape and you don’t yet know what that shape should be.

What I got, about three weeks in, was a sprawling codebase that did approximately one-third of what I wanted, in a way that made the other two-thirds impossible without ripping out the foundation. I should know to expect this by now and I do. But just one more roll of the dice.

On this project, the architecture decisions mattered more than the velocity. Vibe coding optimizes for “let’s see where we end up.” This project needed “let’s make sure we end up in the right place.”

I abandoned it. Not the idea, the codebase. The idea kept coming back.

Attempt Two: Spec-Kit and the Waterfall Trap

A few months later, GitHub had released spec-kit and the discourse had moved on to “spec-driven development.” So I tried it. The pitch is reasonable: write a spec, generate a plan, generate tasks, then build. Front-load the thinking. Don’t let the AI go off the rails.

I wrote the spec. I generated the plan. I generated the tasks. I had a beautiful tree of structured documents that described what I was going to build in extensive detail, even though I was guessing.

And then I sat there. Because the spec-kit output was sized for a project with stakeholders. It’s a stakeholder-management tool dressed up as a planning tool. I am one person. There is no stakeholder. The doc had nobody to satisfy except me, and I kept moving the goalposts on myself. Every section invited another section. Every requirement spawned three sub-requirements. By the time the plan looked “done,” I was tired of the project before I’d written a line of code. And I really was not sure if it was what I wanted, but I didn’t want to change it, because after all the specs were in, it would be like trying to do a 180 in an ocean liner.

Spec-kit is probably the right move if you’re working on a regulated codebase with a real product manager and a real backlog of stakeholder asks. For what I work on in my free time, the overhead eats the energy that’s supposed to fuel the build.

I closed the spec-kit folder. Two attempts down. The idea wouldn’t leave though.

The Living Plan: One Document, Numbered Decisions

Here’s what worked. It is boring and relatively dumb.

I made one file at the root of the repo. I called it PLAN.md. The first thing that went into it was this:

# Living Planning Doc

> Living document. Built up across multiple planning sessions. Not a finalized
> spec, just a running record of decisions made, context that matters, and open
> questions still to resolve. Append, refine, don't rewrite.

## Why this document exists

I've spent months stuck on this project because the research is too large
to consolidate alone. The goal of these planning sessions is to get the idea
right before building anything, not to optimize for build speed. This file
is the persistent memory that survives between sessions so prior decisions
don't get re-litigated.

That opening section was the thing that unblocked me. Naming the fear out loud turned out to be more useful than any of the architecture decisions that came after it. Because I’m an idiot, I’d been treating planning as something separate from building, instead of admitting that the planning was the part I was actually failing at.

Below that header, the doc has three kinds of entries. Decisions numbered D1, D2, D3. Open Questions numbered Q1, Q2, Q3. And a Parking Lot of items numbered PF1, PF2, PF3: the stuff I’m not working on right now but that popped up during a session and I don’t want to forget.

Here’s what an entry looks like:

### D5: Per-workspace knowledge store

Each unit uses a two-folder pattern inspired by Karpathy's LLM wiki idea:

- `raw/`: immutable dump zone. Source material stays whole. No chunking,
  no preprocessing. Some of this content depends on properties that
  chunked retrieval destroys, so it has to be loaded whole.
- `wiki/`: agent-maintained structured entity pages built from `raw/`.
  Knowledge compounds across sessions instead of being re-derived from raw
  sources every time.

This replaces an earlier sketch that mirrored the repo's dual-layer KB into
each unit. Those indexes work for the repo's unstructured research, but they
mismatch the per-unit data shape (small, curated, handpicked).

And here’s what an open question looks like before it’s answered:

### Q7: Round-trip, where does the human-edited final version live?

AI-generated drafts land in `runs/`. Human-edited finals need to land back
in `raw/` so the next iteration can learn from them. Open questions:

- Is the front-matter contract honored on output, or applied on ingestion?
- How do we mark a file as "AI-touched" vs "human-only" without forcing
  manual labeling of legacy material?
- Do we need a separate retrospective stage?

And here’s the third kind: a parking-lot item. This is the one I almost left out of this post, because it came later:

### PF4: Per-client style overrides that learn from my edits

Not building this now. Far enough out that the shape will probably
change before I get there. But: when I hand-edit a generated draft,
the edits are signal. Eventually the per-client rules should learn
from the round-trip, catching the same gotcha next time instead of me
fixing it by hand every run. Direction, not a decision.

Here’s why that section exists: I get worried that if I don’t write an idea down the second I have it, I’ll lose it. Not “might.” Will. It happens to me constantly. So I could have done the roundabout thing I used to build the project’s knowledge base: dump the idea into my Obsidian vault and let the sync script drag it over into the repo eventually, where it’d surface in some future session. That works, but it’s a long way around for “don’t forget this.”

The parking lot is the shortcut. I say it out loud mid-session and Claude drops it into the doc: out of my head, into the same file the actual work lives in. And I stop carrying it. I know it’s written down somewhere I’ll see it, somewhere that’s in the pipeline of things that are going to happen, so the part of my brain that was anxiously holding onto it can let go.

These aren’t decisions, and they’re not even questions I’m ready to sit down and answer yet. They’re halfway decisions. Directions. Things far enough out that they’ll probably look different by the time I touch them, and that’s fine. A PF item that’s never urgent just sits there, harmless, no longer renting space in my head.

The protocol is just as simple. Each session opens with “what’s the next open question on the list,” I work through one (sometimes two), and at the end Claude Code appends the resulting decision back to the doc: promoted from Q7 to D7. The question gets struck through but stays in the doc as historical record. Nothing gets re-litigated unless I explicitly reopen it. The parking lot feeds the top of that same funnel: when a PF item finally gets ripe, it graduates into a Q I answer, and the answer becomes a D. PF to Q to D, or it never moves and that works too.

This is the part that does the work that I thought spec-kit would do for me, with 10x less ceremony. Past-me argued the case. Future-me has to honor the decision or explicitly overturn it. Maybe there’s a way to use spec-kit for a project that morphs as it develops, but this works for me.

The Daily Sync: When the Knowledge Base IS the Project

There’s one more thing the planning doc depends on: a sync script.

Every day-ish, I run this from the repo root:

python sync.py

What that script does: it pulls fresh research notes from my Obsidian vault into a research/ folder in the repo, and re-indexes the knowledge base. Takes maybe thirty seconds. The reason I run it daily is that I’m constantly clipping articles, writing fragments, and capturing prompts into Obsidian during the rest of my day. If the repo’s index lags, my planning sessions can’t see what I already know.

The research folder has grown into the most valuable part of the repo. There are sixty-plus clipped articles in there now, organized by topic. There are skill definitions from earlier experiments I want to reference. There are book highlights I exported from Kindle. There are papers.

That’s the corpus the planning sessions reach into when I open a question.

Two MCPs, Two Indexes, Two Different Questions

The repo runs two MCP servers during planning sessions. Both are pointed at the same research/ folder. They index it two different ways.

{
  "mcpServers": {
    "prose-kb": {
      "command": "uv",
      "args": ["run", "--project", "kb", "python", "kb/server/prose_mcp.py"],
      "env": {}
    },
    "graphify": {
      "command": "bash",
      "args": ["kb/graphify_serve.sh"],
      "env": {}
    }
  }
}

The first one is a semantic chunk search. It splits prose into chunks, embeds them, and lets the assistant query “find me passages about X.” If I half-remember reading something months ago about, say, the way constrained generation interacts with structured output, prose-kb is the tool that surfaces the paragraph. It answers the question “where did I write down the thing about this?”

The second one is a knowledge graph. It walks the corpus, pulls out entities and the relationships between them, and clusters them into communities. It exposes tools like get_neighbors, query_graph, shortest_path, and get_community. Where prose-kb is great when I remember reading about something, graphify is great when I don’t know what to ask. You start at a known concept and walk outward. “What’s connected to this idea? What cluster does this belong to? What’s a few hops away?”

They’re not redundant. Chunks and graphs answer different shapes of question, and you can’t fake either with the other.

Here’s the kind of session where I actually need both. Last week I had an open question about how aggressive a particular processing step should be: should it transform aggressively or just trim? I asked graphify for the community of concepts around “post-processing” in the research corpus. That surfaced a cluster of nodes I hadn’t realized were related, including some passages from a book I’d clipped six months ago. Then I asked prose-kb to pull the actual paragraphs from those clipped sources. Two queries, two lenses, one decision recorded back to PLAN.md. Without graphify I wouldn’t have known to ask. Without prose-kb I’d have gotten a summary instead of the actual passages.

When the Assistant Lied to My Face

Now a sidetrack, because these type of things don’t happen as often any more, so worth bringing up.

A few weeks back I noticed there were two graphify-out folders in the repo. One in the project root, one nested inside research/. I asked Claude about it. Got told: “The one in the root is from an earlier configuration. It’s vestigial. The active one is in research/. You can ignore the root one.”

Cool. Moved on. Came back two sessions later. Both folders had fresh data in them. Asked again. Got told again that the root folder was harmless and the active one was the nested one.

Third time I just went and read the shell scripts myself. There was a stale path in one of the KB rebuild scripts. The script was writing to both folders. The “harmless” folder wasn’t harmless; it was getting half my graph data while the MCP server was serving the other half. The fix took ten minutes once I actually looked instead of accepting the second-hand reassurance.

“You told me there was only one Graphify out folder in use. The other was left behind. But something is still writing to both.”

That’s what I typed when I caught it. The lesson is the lesson, and I should already know it: the planning doc and the MCPs are tools. The two MCPs work. The assistant is smart. None of that makes any of them infallible. When something feels off, go look at the actual file. Don’t accept the reassurance, especially the second time you’ve heard it. Especially the third.

And What I’m Building Uses a Completely Different Pattern

Here’s the part I didn’t see coming when I started this.

The repo I just described, the one with prose-kb plus graphify indexing the research corpus, is the planning environment. It’s where the architectural decisions get made. But the thing the project actually produces organizes its data a third way, and the third way is neither of the two MCPs.

Each unit of work the pipeline creates is structured like this:

some-workspace/
├── raw/
│   ├── article-2024-03-12.md
│   ├── article-2024-08-19.md
│   ├── notes.md
│   └── ...
└── wiki/
    ├── overview.md
    ├── style-guide.md
    └── entities/
        ├── concept-a.md
        └── concept-b.md

The raw/ folder is a dump zone. Whatever source material the unit needs lives in there, files intact, not chunked, not embedded. The wiki/ folder is curated structured pages, built by an agent that reads from raw/ and writes to wiki/. The idea comes from Andrej Karpathy’s LLM wiki concept: small, handpicked, agent-maintained knowledge that compounds across sessions.

I deliberately did not mirror the prose-kb-plus-graphify pattern into each workspace. Here’s why.

The corpus inside each workspace is small. We’re talking dozens of files at most, hand-picked, high-signal. Heavy indexing isn’t paying for itself at that scale: the assistant can just read the wiki page.

More importantly, some of the content inside raw/ depends on properties that chunked retrieval destroys. Specifically, properties of the prose itself, like rhythm, cadence, and paragraph structure, that only survive if you load the file whole. Embeddings are great for “find me a thing.” They’re terrible for “preserve the texture of how something is written.” If I’d reused the dual-index pattern inside each workspace, I’d have lost the very thing the workspace exists to capture.

This is the sort of decision that would have come out wrong in either prior attempt. Vibe-coding-me would have used the indexing pattern that was already on the table because it worked at the repo level, and discovered the problem six weeks later when the output was bad. Spec-kit-me would have written a fifteen-page rationale for the choice and forgotten what the original problem was halfway through.

Living-plan-me raised Q5, talked through the trade-offs with the MCPs as backup, and made the call: Claude wrote it up as D5. Maybe ninety minutes from “this is a question” to “this is the answer.” The answer might still be wrong, but it’s explicit and it’s findable and the next time future-me wonders why we did it this way, the doc tells him.

Pick the data shape that matches what you’ll actually do with the data. Don’t reuse a pattern just because it worked somewhere else in the same project. Three different knowledge-organization strategies in one codebase, each chosen for its specific job. None of them is universally right. The fashionable choice is rarely the right one, and the right one is often the one you’d find boring.

What This Process Actually Produces

PLAN.md has been touched sixteen times in the last thirty days. More than any code file in the repo. The plan is the most active artifact.

The doc is the deliverable. Code is a byproduct of decisions being made well. It feels backwards, because vibe coding rewards code-as-output. But it’s the same thing spec-kit was trying to enforce, except the doc is allowed to be uneven and grow and you’re allowed to leave a Q7 open and walk away for three days before answering it.

The decisions stack up. The Q list shrinks. Sometimes a new question pops up because an earlier one got answered in a way that opened it. That’s fine. The doc is allowed to grow.

And I Actually Know What’s In There

Here’s one thing I wasn’t tracking when I started doing this. I understand the codebase.

Not in the “I wrote it last week and it’s fresh” sense. In the “I can tell you why any load-bearing decision is the way it is, and which D# it traces back to” sense. Months in. The Q-to-D-to-code pipeline produces working software as one output and a builder who actually understands his own code as the other.

Compare that to where vibe coding lands you. You get a working thing for a while. You also get a codebase whose decisions you didn’t make explicitly, only half of which the model still remembers, and the model will cheerfully reassure you about all of it.

The interaction model fixes this almost by accident. Every D# in the plan got argued for. I sat in it. The model pushed on a position, I half-agreed, I changed my mind in the next exchange, and then it became a numbered decision. By the time code shows up to enact it, the rationale is already loaded into the part of my brain that has to maintain it.

The one soft spot is when a module has been quiet for a few weeks and I have to go re-touch it. I know I made the decisions, but the texture blurs. So I added learning-opportunities to my global Claude Code skills. When I’m about to change a file I haven’t touched in two weeks, twenty minutes with that skill pointed at it puts me back where I was when the original calls got made. And some days I just point it at what I built that day and let it walk me back through the choices I made: to get a layer of depth out of decisions I’d otherwise just move past.

The Slow Middle Path

Could this approach still fail? Sure. Third attempt could become fourth attempt. Some of the decisions I locked in early might turn out wrong and force a rewrite.

But moving from “stuck for months” to “moving forware” is good, and that’s what I came back for. The first attempt taught me that vibe coding doesn’t work on projects where the architecture matters more than the velocity. The second attempt taught me that spec-kit costs more than it’s worth for one person on one project. The third one is showing me what the middle path actually looks like, and the middle path turns out to be one document, two MCPs, and the discipline to get every decision written down as D7 instead of trying to remember what we decided last Tuesday.

One honest scope note before you run off and try this. The reason it works this well for me is that I’m not building toward a spec somebody handed me: I started this not fully sure what it needed to do by the end, and I still move the destination as I learn. That’s the part I like most: it’s an exploration tool more than a planning tool. Spec-kit assumes you already know what you’re building and the job is to pin it down precisely enough that nobody drifts off it. This is the opposite situation: one person who doesn’t know yet what the thing should be, using the doc to think his way toward it. I wouldn’t run a team’s production roadmap this way; that’s exactly where spec-kit’s ceremony earns its keep. But for figuring out whether a thing should even exist, and what it is once you decide it should? It works great.

Update: Extending the Living Plan Into the Build

A few weeks after I wrote everything above, I did the obvious thing: I pointed the whole approach at a second project. Different domain entirely: a knowledge-graph-based site. Same setup, though: research two inches deep, the same paralysis, the same PLAN.md at the root of the repo with its D# decisions and Q# questions. It worked again.

But this was another project I started on slowly, again because I wasn’t sure where it was going, and it actually had some working code. But the process above ends on a tidy line: the doc is the deliverable, code is a byproduct. Then conveniently stops before answering the question that actually matters: okay, so who writes the code, and how do you hand the plan off without dragging fifteen hundred lines of decisions along for the ride?

I’ll get to how that works. It grew with the process. But first, how I do planning, because plan mode only works for me when I have a complete idea.

Always Opus, Never Plan Mode

I plan in Opus. Only Opus. And Claude Code has an official plan mode that I never touch: not for this, not really for anything open-ended. Both of those come down to one rule: the planning is the conversation, and I won’t put anything between me and the back-and-forth that makes it work.

Start with Opus, because it’s the easy one. This isn’t brand loyalty. I tried to build this exact second project a year ago by opening Claude Code and just telling it to go. It built something. It did not build what I’m building now, not close. The difference isn’t the model’s coding ability. It’s that planning is a thinking activity, and the back-and-forth (me half-forming a position, the model pushing on it, me realizing I was wrong three exchanges in) is the entire mechanism. You don’t get that from a model racing to the answer. You get it from one that’ll sit in the question with you.

And on the Claude Code Pro plan, the math is friendlier than I expected. One planning session ran close to three hours, all in one chat, and I came out of it at 62% of my usage and 21% of context on a million-token window. Three hours of hard thinking for two-thirds of a day’s budget. Opus-for-all-planning is just affordable, so I stopped agonizing about it.

Plan mode is the same rule pointed at the interface instead of the model. It isn’t the back-and-forth I described up top. It tends to collapse the conversation into multiple-choice questions. Pick A, B, or C. And the problem with that, for a decision that’s still half-formed, is that the options are never quite fit the thing in my head. The questions miss points. The choices aren’t narrow enough. I end up arguing with the menu instead of answering it: which means I’m chatting anyway, just chatting against a format that’s fighting me.

I’ve got a clean example from the second project, and it’s a good one because it shows the cost. We were deciding the site’s whole positioning. The assistant, being helpful, served me a multiple-choice. Under that format pressure I picked “aggregator,” because it was the closest box. It wasn’t right. It was just the least-wrong option.

Then I stopped and typed something like: maybe we should just discuss this instead of you handing me multiple-choice. I always get stuck on those because I’m between two of them and it feels like slapping a label on something that doesn’t have one yet. So we talked it through and landed somewhere completely different and obviously better. That’s the decision that got written into the plan. The pick from the multiple-choice would have quietly steered weeks of work in the wrong direction.

The menu makes you commit before you understand. The conversation is the part that does the work and the menu is optimizing away the only step that mattered.

The Second File: TASKS.md

In the original setup there’s one file, PLAN.md, and it does everything. The extension is that when the plan is solid enough to act on, Opus writes a second file: TASKS.md.

  • PLAN.md is the why. Decisions, rationale, the argument past-me had with himself. Append-only. Numbered. Never re-litigated.
  • TASKS.md is the what. Active execution state, and nothing else. It gets recreated at build kickoff. I just deleted the old one, because its “what happened and why” job now belongs to PLAN.md.

And the constraint I put on TASKS.md: it has to stand on its own. The header I make Opus write into it says so:

> Self-contained execution file. Each task is written so a doer (e.g. a
> Sonnet or Haiku subagent) can pick it up and execute without loading
> PLAN.md. Every decision-specific fact a doer can't derive on its own is
> inlined here. The trailing (ref: D#) tags point to PLAN.md decisions for
> human / orchestrator traceability only. A doer can ignore them.

So the top of the file is a “Shared facts” block that inlines everything a cold reader would otherwise have to go digging in PLAN.md for: the positioning the whole thing gets judged against, where the code and the database live, the rules every task has to follow. Then each task says what to do, with a little (ref: D5) breadcrumb back to the decision that justified it. The breadcrumb is for me, not for whoever executes the task. They are told to ignore it.

The Handoff Works, and the Plan Never Goes Quiet

When I started this article, I hadn’t tried the two document plan I just described yet. But it worked and it had a bonus.

The split holds exactly like the design said it would. Opus keeps PLAN.md and orchestrates. The build goes to Sonnet in its own context. It only reads TASKS.md, which stands on its own by design, does its file work over in its own window, and hands back a summary. The main Opus thread stays whole and cached, which was the whole reason I went with subagents over clearing context in the first place: clearing context mid-conversation wrecks your prompt caching, but a subagent isn’t cleared context, it’s separate context.

While Sonnet is grinding through a task in the background, I’m not sitting there watching a progress bar. I’m still in the foreground with Opus, working the next open question. The doer builds while the planner keeps planning. Q8 gets answered and written down as D8 in the same stretch of time Sonnet spends turning D5 into actual code.

I’d been treating it as two phases in strict sequence: plan until the plan is solid, then hand it off and build. And the first half of that is still true. On a project with no code yet, you do plan first, because there’s nothing to build from until the decisions exist. What I had wrong was the second half. Once the build starts, the planning doesn’t stop. It runs alongside the build, because the build doesn’t occupy me. It occupies a subagent.

Trust but verify is still the rule and still load-bearing. A summary tells me what Sonnet meant to do, not always what it did, so Opus reads the files the doer actually wrote before anything gets integrated. I’m not skipping that, because I’ve already seen how that movie ends a few sections up.

But the headline holds. PLAN.md is the deliverable, code is the byproduct, and now the byproduct gets built in the background while I stay in the foreground making the decisions that produce it. Three attempts to get here. The first had me building with no plan. The second buried me in a plan I couldn’t build from. The third writes the plan and builds from it at the same time, and I finally get to be the one person in the room whose only job is to think.

Stephan Miller

Written by

Kansas City Software Engineer and Author

Twitter | Github | LinkedIn

Updated