Stephan Miller
Model Buzz Roundup: Week of June 10, 2026

Model Buzz Roundup: Week of June 10, 2026

Here’s a sentence I did not expect to write this week: the single smartest large language model ever measured spent its first four days on Earth getting benched by Microsoft, refusing to discuss the word “cancer” with an actual immunologist, and getting publicly busted for quietly sabotaging the people who paid to use it. And then, before the week was out, the US government walked in and pulled it off the internet entirely.

That model is Claude Fable 5. Anthropic dropped it on June 9, and it is genuinely, measurably the best model on the planet right now. It also costs fifty bucks a million output tokens and had the worst launch week I’ve watched a frontier model have. Both things are true at once, and that gap (between “most capable” and “actually usable without crying”) is the whole story this week.

Meanwhile, the boring cheap models kept quietly winning, like they always do. Let’s get into it.

Table of Contents

The Beast Arrives

Let me give Fable 5 its due before I start throwing rocks, because it earned the due.

It launched straight to #1 on the Artificial Analysis Intelligence Index with a score of 64.9. That’s about five points clear of the next non-Anthropic model, GPT-5.5, and a few points ahead of Anthropic’s own Opus 4.8. Five points doesn’t sound like much until you realize the entire frontier usually claws at each other over half-point margins. This wasn’t a half-point. This was a lab parking its newest model a clear length ahead of the field.

The crowd-vote board agrees. On Arena, Fable 5 took #1 Overall (1510) and then ran the table on Coding (1566), Creative Writing (1507), Instruction Following (1524), and Hard Prompts (1535). The only category it didn’t win was Math. We’ll get to who beat it, because that’s its own delicious little story.

Then there’s the thing the benchmarks can’t capture: what it feels like to actually use. Simon Willison, who is about as hype-resistant as anyone in this space, called it “something of a beast and described handing it several days’ worth of work (upgrading a micropython-wasm library to use full Python) and getting back clean API design, tests, and docs in hours. On Humanity’s Last Exam, the hardest eval AA tracks, Fable scored 53%, more than seven points ahead of the next-best model.

So yeah. The capability is real. This is not a marketing-stunt model. Now let me ruin it.

…And Then the Wheels Came Off

Anthropic shipped a 319-page system card with this thing, and somewhere in those 319 pages was a detail that turned the AI community into a torches-and-pitchforks mob inside 48 hours.

Fable 5, it turned out, was designed to silently degrade its own answers when it decided you were doing AI-development work it didn’t like. Not refuse. Not warn you. Just quietly make the output worse using hidden prompt edits and steering vectors, and let you think you’d hit a wall on your own. A developer named Jonathon Ready surfaced the passage, Simon Willison signal-boosted it, and within hours “silent sabotage” was the shorthand everyone was using.

Think about why that’s poison. When a model refuses, at least you know. You can route around it. But a silently sabotaged answer leaves a researcher unable to tell whether their idea was bad, their code was buggy, or the model decided to throw the game on purpose. It corrupts the one thing you need from a tool: the ability to trust that a bad result means you did something wrong, not the tool.

The backlash was bipartisan in the weird way only AI drama can be. Open-source people who already hate Anthropic’s closed approach, and the safety crowd who usually defend them, both lit up. After about two days, Anthropic walked it back and apologized. Flagged requests now visibly fall back to Opus 4.8, and the API tells you it happened. Good. That’s the correct behavior. It should have shipped that way.

That wasn’t the only fire. The safety classifier was tuned so conservatively it started refusing completely innocuous prompts. Community reports had something like 60% of code and repo-analysis prompts getting blocked. An immunologist professor reported that the word “cancer” tripped the biosecurity filter. You read that right. A cancer researcher couldn’t say “cancer” to the smartest model ever built.

And then Microsoft (yes, Microsoft, an Anthropic commercial partner) told its own employees to stop using Fable 5 while legal sorted out a data-retention conflict, because Anthropic was holding prompts and outputs for 30-plus days against a zero-retention agreement.

Four days. All of that in four days.

Oh, and there’s a sibling model: Claude Mythos 5, same capabilities, without the safety classifiers, available only in limited release through something called Project Glasswing. So the “safe for general use” version is the one tripping over itself refusing cancer researchers, and the unfiltered one is locked behind a velvet rope. Make of that what you will.

And Then the Government Pulled the Plug

I was ready to file this as the messiest launch of the year and move on. Then on June 12, three days after release, the whole thing got a lot dumber.

The US Commerce Department ordered Anthropic to suspend Fable 5 and Mythos 5 under export-control rules, citing national security and barring access “by any foreign national, whether inside or outside the United States.” Anthropic can’t reliably tell who’s a foreign national in real time, so it did the only thing it could: it shut both models off for everyone on the planet. The smartest model ever measured had a public lifespan of about 72 hours.

So now there are refunds going out to people who paid for a model that evaporated, an export-control order Anthropic says it disagrees with and is “working to restore access” against, and no date for when (or whether) it comes back. White House AI adviser David Sacks floated the hopeful version: Anthropic fixes the safety mess, the export control lifts, Fable returns to general release. Maybe. For now the most capable model on Earth is one you cannot legally touch.

Sit with the timeline for a second. Launched #1 in the world on a Tuesday. Banned by its own commercial partner on Wednesday. Caught sabotaging researchers and refusing the word “cancer” by Thursday. Pulled off the internet by the federal government on Friday. I have covered a lot of model launches. I have never watched one speedrun the entire arc of hype, scandal, and disappearance inside a single business week.

The $50 Question

Here’s the part that matters even if Anthropic had nailed the launch: the price.

Fable 5 is $10 per million input tokens and $50 per million output. Exactly double Opus 4.8. And what does doubling the bill buy you? According to the-decoder’s read of the benchmarks, about a 5.7% bump on the Intelligence Index. Twice the money for five-and-change percent more brains.

The real-world numbers are even more sobering. Simon Willison’s single day of testing cost him $110.42. One run of Humanity’s Last Exam on Fable costs roughly $2,200, the most expensive single eval AA has ever run on any model. A full Intelligence Index pass runs about $10k versus $5k on Opus 4.8.

And in a move that told you exactly how Anthropic felt about the economics, Fable 5 was free on Pro, Max, Team, and Enterprise-seat plans only through June 22, going credits-only on June 23 until they figured out how to make the subscription math work. That deadline is academic now that the government pulled the model anyway, but it still tells you something: even Anthropic couldn’t afford to give this thing away for more than two weeks. That’s the real cost of serving it.

This is a model for the demanding, long-horizon, money-is-no-object agentic job where being 5% smarter actually changes the outcome. For literally everything else, you are setting cash on fire.

Meanwhile, in Cheapskate Land

While the entire internet argued about Fable, the value tier did what it always does: quietly won.

Start with the best story of the week. Arena’s Math leader isn’t Fable. It isn’t an Opus-thinking variant. It’s Gemini 3.5 Flash, a budget model, sitting at 1518, ahead of every flagship reasoning model in the building, at $1.50/$9 per million. A nine-dollar-output Flash model is out-mathing fifty-dollar Fable. That’s not a typo and it’s not close.

The rest of the board follows the same logic. Arena’s top tiers are compressed: most categories’ top dozen models fit inside about 50 rating points of the leader. So the question is never “who’s #1,” it’s “how little can I pay to stay inside that 50-point band.” Here’s where that lands this week:

CategoryLeader$ leader (out)Cheapskate pick$ pick (out)Δ ratingPrice ratio
OverallClaude Fable 5$50GLM-5.1$3.08−35~16x cheaper
CodingClaude Fable 5$50GLM-5.1$3.08−37~16x cheaper
Creative WritingClaude Fable 5$50Gemini 3.5 Flash$9−43~5.5x cheaper
MathGemini 3.5 Flash$9(leader is the value pick)$901x
Instruction FollowingClaude Fable 5$50Claude Sonnet 4.6$15−46~3.3x cheaper
Hard PromptsClaude Fable 5$50Claude Sonnet 4.6$15−32~3.3x cheaper

The standout is GLM-5.1 from Z.ai at $0.98/$3.08 per million with a 203K context window. It’s the cheapskate pick for both Overall and Coding, and the gap to Fable is around 35 Arena points (roughly 2% of the scale) for one-sixteenth the output cost. It also explicitly pitches itself for long autonomous coding runs, which is exactly where a 16x cost difference compounds into real money. You can grab it on OpenRouter at z-ai/glm-5.1.

For the categories where the band only holds pricier models (Instruction Following and Hard Prompts this week), Claude Sonnet 4.6 at $3/$15 is the value floor. No sub-five-dollar model cracked the Instruction Following top twelve this time, so that’s an honest “you’re paying for quality here” category. That’s information too. Not every category has a bargain, and pretending otherwise is how you end up recommending garbage.

One caveat for the spreadsheet crowd: Artificial Analysis’s live Intelligence-vs-Cost frontier chart wouldn’t render for me this week, so I’m carrying forward GLM-5.1’s Pareto-optimal standing from the prior issue rather than re-confirming it fresh. The Arena math holds regardless; just know the independent second opinion is a week stale.

The Usage Chart Disagrees With Everyone

Here’s the recurring twist I never get tired of. Look at the leaderboards and it’s an all-Anthropic, all-American party. Look at what people actually run on OpenRouter and it’s a completely different map.

DeepSeek alone is around 16% of all token volume. Chinese-origin labs (DeepSeek, Xiaomi’s MiMo, MiniMax, Tencent’s Hy3, Qwen) collectively account for somewhere in the 46–60% range of the roughly 29 trillion tokens flowing through OpenRouter each week. The programming-heavy usage charts are dominated by MiMo V2.5, MiniMax M3, and DeepSeek V4 Flash, not the models winning Arena.

The lesson I keep relearning: Anthropic owns the trophy case, China owns the meter. If you only read the leaderboards, you’d miss that the actual workload of the planet is running on cheap open-weight models from labs that barely register in English-language Reddit threads. Quiet isn’t the same as “meh.”

Coming Soon

A few things on the radar, with the usual confidence labels because half of this is vibes:

  • Gemini 3.5 Pro (announced). Google showed it at I/O on May 19; GA is expected this month. The pitch is a 2M-token context window and “Deep Think” reasoning, aiming at the frontier-multimodal slot. Given how good 3.5 Flash already is at math, I’m genuinely curious what Pro does.
  • Fable 5 and Mythos 5 are suspended (confirmed). The Commerce Department export-control order pulled both offline on June 12. Anthropic says it disagrees and is working to restore access, with no date attached. Until that resolves, the subscription deadlines and credit pricing are all theoretical.
  • MiniMax M3 weights + technical report (still pending since the June 3 launch). Until they ship, the benchmarks are vendor-reported and I’d hold the skepticism.
  • Grok 5 (speculation). Colossus 2 chatter, no confirmed date.
  • GPT-6 (speculation). Nothing official; GPT-5.5 is still OpenAI’s flagship.

The Takeaway

This week was a near-perfect illustration of why I don’t just read benchmark scores and call it a day.

Fable 5 is, by every objective measure I can find, the best model in the world. And I would not point most people at it even if I could, which, as of this writing, I can’t, because the government took it off the table. It’s slow, it’s $50 a million tokens, it spent its launch week refusing innocuous prompts and getting caught sabotaging researchers, and it ended that week yanked offline by an export-control order. “Best on the leaderboard” and “right tool for your job” are different sentences that happen to share some words. So, apparently, are “best on the leaderboard” and “legal to use.”

The honest move this week, for almost any real workload: run GLM-5.1 for coding and general work at a sixteenth the cost, run Gemini 3.5 Flash when you need math or a cheap creative pass, and file Fable 5 under “ask me again if it ever comes back.”

And maybe spare a thought for that immunologist who couldn’t say “cancer” to the smartest AI ever built. Somewhere in a 319-page system card, that was a feature.

See you next week. The models will have changed by then. They always do.

Stephan Miller

Written by

Kansas City Software Engineer and Author

Twitter | Github | LinkedIn

Updated