Stephan Miller
Anthropic Shipped Its Smartest Model Yet — and Made It Easier to Hijack

Anthropic Shipped Its Smartest Model Yet — and Made It Easier to Hijack

I updated Claude Code to Opus 4.8 the morning after it dropped, the same way I update everything: without reading the patch notes. Anthropic shipped it on May 28, 41 days after Opus 4.7, which is a fast turnaround when you remember that basically nobody loved 4.7. The launch post is wall-to-wall “reliability” and “honesty.” Sounds great. I’m a sucker for a model that lies to me less.

Then I read the part of the benchmark table that wasn’t in the headline. The prompt-injection number didn’t get better with this “honesty” release. It got worse. So here I am, writing this week’s roundup with the model that just became measurably easier to hijack, feeling real good about my life choices.

That’s the theme this week, honestly. Don’t read the launch posts. Read the benchmarks the launch posts don’t link to. Let’s get into it.

Opus 4.8: The Reliability Upgrade That Got Easier to Hijack

Let me be fair before I get snarky, because the model is genuinely good.

Opus 4.8 took the #1 spot on Artificial Analysis’s Intelligence Index at a score of 61, edging out GPT-5.5 in its highest reasoning mode (60). It posts 88.6% on SWE-bench Verified and 69.2% on the harder SWE-bench Pro. On GDPval-AA it hits 1890, up from 1753 on 4.7. On OSWorld-Verified (the computer-use benchmark, the one that actually matters if you’re letting a model click around) it lands 83.4%, a real jump over 4.7.

And here’s the part I actually care about as a daily Claude Code user: it’s roughly four times less likely than 4.7 to let a coding flaw slip through unflagged. That was the whole problem with 4.7. It would confidently barrel ahead with broken code instead of stopping to say “hey, this might be wrong.” 4.8 is the patch for that specific personality defect. Same price, too: $5 per million input, $25 per million output, identical to 4.7, so there’s no migration tax. The new fast mode is $10/$50, and Anthropic claims it’s roughly 2.5× faster than the old fast mode.

Now the snark.

This release is wearing a “honesty and reliability” t-shirt, and underneath it, the Gray Swan prompt-injection number went from 6.0% on 4.7 to 9.6% on 4.8. Higher is worse. If you’re running agentic pipelines over untrusted input — scraping the web, processing user-submitted tickets, anything where the content isn’t yours — the model marketed as the safe one is the one that got more hijackable. That’s the kind of thing that doesn’t make the launch slide.

There’s also the new Dynamic Workflows tool in Claude Code, which decomposes a task into parallel subagents on the fly. Cool feature. Also a feature that “consumes substantially more tokens than typical sessions.” Turn it loose on a big job without a budget and the invoice does its own little dynamic workflow.

One more asterisk: 4.8 is completely absent from the Arena leaderboard right now. Not because it’s bad: because it’s a day old and nobody’s voted on it yet. Arena under-indexes new models hard, so when you see the new hotness missing from the head-to-head rankings, that’s a freshness gap, not a quality verdict. Give it two weeks.

Following the Money: What People Actually Run

If you only look at quality leaderboards, you’d think this is a three-lab race between Anthropic, Google, and OpenAI. Then you look at what people are actually paying to run, and the picture flips.

On OpenRouter this week, the #1 model by token volume is DeepSeek V4 Flash at 3.53 trillion tokens, up 17%. It costs about $0.10 per million in, $0.20 out. That’s not a typo. The most-used model on the platform is a Chinese open-weight model priced like a rounding error.

Claude Opus 4.7 jumped +73% to #3 (2.64T tokens): that’s the launch churn, everybody touching Opus right as 4.8 landed. But here’s the number that actually tells the story: by author, Anthropic holds 18.7% of all OpenRouter traffic and DeepSeek holds 18.0%. They’re neck and neck. The premium lab and the cheap-open lab are splitting the platform down the middle.

And notice what’s not on the Arena overall leaderboard: DeepSeek. At all. Top 25 and it’s not there. That’s not because DeepSeek V4 Flash is bad: it’s the Arena/Reddit-skews-Western blind spot. Flash is a workhorse, not a show pony. The people running it have it wired into a “Pro plans, Flash executes” pipeline: use the bigger model to design the approach, hand the grunt implementation to Flash. One developer’s summary that stuck with me: it “replaced Sonnet 4.6 as my executor — fast, decent results,” with the caveat that it’s “too shallow for complex decisions” and you have to be specific or you get vague output. That’s not a model topping a leaderboard. That’s a model people actually use, quietly, all day.

The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)

The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)

The single biggest mover this week is MiMo-V2.5-Pro, Xiaomi’s model, which jumped +475% to #9 on OpenRouter. The catalyst: Xiaomi open-sourced the weights under MIT. It’s a 1-trillion-parameter mixture-of-experts model with 42B active per pass, a 1M-token context window, priced at $0.43 in / $0.87 out.

Yes, the phone company. The one in your friend’s pocket. It scored 54 on the AA Intelligence Index (tying Kimi K2.6, ahead of GLM-5.1’s 51) while costing 87 cents per million output tokens. That’s a lot of measured intelligence per dollar; it’s smarter than its price tag has any right to be. It’s under-voted on Arena because, again, the enthusiast crowd skews toward the Western labs, so the quiet around it isn’t “meh,” it’s a measurement gap.

Hold onto MiMo. It’s about to win two categories outright.

Hype vs. Value

Quick gut-check on what’s overcooked and what’s underrated this week.

Probably hype (for now):

  • Opus 4.8 — I know, I just spent a whole section praising it. It’s real. But it’s one day old, absent from Arena, and shipping a worse adversarial-robustness number under a “safety” banner. The launch-day glow is doing a lot of work. Respect the brain, verify before you trust it in a loop.
  • Owl Alpha — the stealth model on OpenRouter, still sitting at #5 (1.38T tokens) and still nobody’s confirmed who built it. It’s been ~31 days. Remember when stealth models got unmasked in two weeks? Polaris turned out to be GPT-5.1, Sherlock turned out to be Grok 4.1, both inside a fortnight. That clock is dead now. Owl’s free, it’s got a 1M context, it’s tuned for agentic work: and reviewers keep running into a “speed tax,” where it’s capable but slow. Also, free means the provider logs all your prompts to improve the model. Free is never free.

Under-sold value:

  • MiMo-V2.5-Pro — covered above. AA index 54 at 87 cents, open weights, +475% real usage, barely a whisper on Arena.
  • DeepSeek V4 Flash — #1 by volume for weeks, near-invisible on the preference leaderboards, because the people who depend on it are shipping, not posting hot takes.

The Cheapskate Picks

This is the part I actually do the math for, because it’s the part that saves you money.

Here’s the thing about the Arena leaderboard nobody says out loud: the top is compressed. In the overall category, the #1 model sits at 1502 and #25 sits at 1466. That’s a 36-point spread across the entire visible top end of a ~1400-point scale. Which means the “best” model is often only marginally ahead of something 8 to 30 times cheaper. So the move is: anchor on the category leader’s rating, draw a band 50 points down from it, and pick the cheapest model still inside that band. You give up a rounding error of quality and you keep most of your money.

Here’s how that shook out this week. (Prices are OpenRouter output dollars per million tokens.)

CategoryLeader$ leaderCheapskate pick$ pickΔ ratingPrice ratioPick’s AA Index
OverallOpus 4.6-thinking$25Gemini 3 Flash$3−29~8×
CodingOpus 4.7-thinking$25GLM-5.1$3.08−28~8×51
Creative WritingOpus 4.6-thinking$25Gemini 3 Flash$3−38~8×
Instruction FollowingOpus 4.6-thinking$25MiMo-V2.5-Pro$0.87−41~29×54
Hard PromptsOpus 4.6-thinking$25GLM-5.1$3.08−34~8×51
MathGemini 3.5 Flash$9MiMo-V2.5-Pro$0.87−38~10×54

A few things jump out.

MiMo is the MVP. It wins Instruction Following and Math outright on a value basis, and the Instruction Following trade is the best deal in the whole roundup: you’re 29× cheaper on output for a 41-point rating gap, which is under 3% of the scale. And MiMo isn’t just an Arena artifact: it also scores 54 on AA’s Intelligence Index, which is built on hard benchmarks and ignores crowd preference entirely. Two completely different measurement styles (crowd votes on Arena, objective benchmarks on AA) landing on the same cheap model is about as high-confidence as a recommendation gets.

GLM-5.1 owns the coding-flavored categories. It’s within 28 points of the best coding model on the board for about 8× less ($0.98 in / $3.08 out). One caveat worth knowing: it’s a 203K context window, not the 1M the flagships give you. If you’re feeding it a giant monorepo, that matters.

Gemini 3 Flash holds the generalist slots (Overall and Creative Writing) at $0.50 in / $3 out. In the Overall band it actually out-ranks Claude Sonnet 4.6, so the cheap option is beating the mid-tier from a pricier lab.

And the weird one: in Math, even the leader is cheap. Gemini 3.5 Flash, a value-tier model at $9 output, is the outright #1 in the Math category. So if you want the absolute top math model, you’re not even paying flagship prices. The cheapskate floor below it is MiMo at 10× less. Math is the rare category where there’s just no reason to reach for a $25 flagship at all.

Every category this week had a sub-$3.10 option inside the competitive band. There was no “you’re just paying for quality here” category, which doesn’t always happen. Good week to be cheap.

Horror Stories

Every roundup needs its hall of shame. This week’s mostly comes from the launch I opened with.

Horror Stories

  • The reliability release that got easier to hijack. Opus 4.8’s headline is honesty, but its Gray Swan prompt-injection success rate climbed from 6.0% (4.7) to 9.6% (4.8). If your agent reads untrusted input, the “safer” model is the more hijackable one. Read the adversarial benchmarks, not the press release.
  • Dynamic Workflows, dynamic bill. The shiny new parallel-subagent decomposer in Claude Code burns substantially more tokens than a normal session. Powerful, but budget it before you scale it, or the feature optimizes your spend in the wrong direction.
  • Owl Alpha’s speed tax. Free, 1M context, agentic-tuned, ~31 days into a stealth run with no provider reveal; and reviewers keep hitting slow throughput, while the provider quietly logs every prompt you send. Nothing free is free; sometimes you pay in latency and data.

On the Horizon

What’s coming, with the appropriate amount of salt:

  • Gemini 3.5 Pro / Gemini 3.2rumored, June. Google ships on a quarterly cadence and 3.5 Flash already landed May 19, with the Pro tier apparently slipping. Treat the date as a pattern guess, not a promise.
  • Grok 5 (xAI)announced, in training. Reportedly 6 trillion parameters, training on the Colossus 2 supercluster (1GW scaling toward 1.5GW), which would make it the largest publicly disclosed model. Q2 target.
  • Claude Mythosrestricted preview. Anthropic’s high-ceiling model, limited to Project Glasswing partners for defensive cybersecurity, with eye-watering reported benchmark numbers. The one to watch, if you can get near it.
  • GPT-6speculation. Codename chatter, late-2026 expectations. Nothing solid.
  • Step 3.7 Flash and Grok Build 0.1live now on OpenRouter (showed up around May 20), not yet ranking. Worth a look if you collect models like I do.

What This Week Tells You

No inspiration porn, just the honest read.

The gap between what tops the leaderboard and what you should actually pay to run has never been wider. Opus 4.8 is, genuinely, the best brain on the board right now: and also a thing you shouldn’t hand untrusted input without thinking about it first. Meanwhile a phone company is giving away a model that wins two value categories outright, and the most-used model on OpenRouter costs ten cents a million tokens.

If you take one thing from this week: stop defaulting to the flagship for everything. Anchor on the leader, find the cheapest thing 50 rating points behind it, and pocket the 8-to-29× difference for the jobs that don’t need a genius. Save the flagship for the work that actually does. And even then, check what it does when somebody feeds it something nasty.

Stephan Miller

Written by

Kansas City Software Engineer and Author

Twitter | Github | LinkedIn

Updated