Claude Opus 4.8 is excellent for coding agents, but expensive to live in

Claude Opus 4.8: better judgment is the headline

Anthropic released Claude Opus 4.8 on May 28, 2026, positioning it as an incremental upgrade over Opus 4.7 rather than a new model class. That framing matters. Anthropic itself calls it a “modest but tangible improvement,” which is more useful than pretending every release resets the field. (anthropic.com)

The most interesting claim is not raw benchmark movement. It is honesty. Anthropic says Opus 4.8 is around four times less likely than Opus 4.7 to let flaws in code it has written pass without comment. Early testers also reported that the model is more likely to flag uncertainty and less likely to make unsupported claims. For coding-agent work, that is a meaningful kind of improvement: the model does not just need to write code; it needs to notice when its own plan is weak. (anthropic.com)

That matches my own impression. Opus 4.8 is very good. For serious coding-agent work, it feels like the model you reach for when you need judgment, patience, and architectural sense. Compared with GPT-5.5, the best way I can put it is: Opus 4.8 feels like the older, wiser brother. It is more reflective, more careful, and often better at pushing back. But that wiser brother is only around for a couple of hours a day, while GPT-5.5 is available for longer and with fewer practical limits.

The product update around the model also matters. Anthropic added Dynamic Workflows in Claude Code, a research-preview feature for breaking larger engineering tasks into parallel subagents. Anthropic says Claude Code with Opus 4.8 can handle codebase-scale migrations across hundreds of thousands of lines, using the existing test suite as the bar. That is exactly the sort of work where Opus’ stronger judgment could be valuable — not autocomplete, but multi-step software maintenance. (anthropic.com)

The complication is cost. Regular Opus 4.8 pricing is unchanged from Opus 4.7 at $5 per million input tokens and $25 per million output tokens. Fast mode is $10 per million input tokens and $50 per million output tokens. That is workable for high-value coding tasks, but not cheap enough to leave running casually. Even outside the API, the same practical issue shows up as usage limits: Opus can feel wonderful until you hit the wall. (anthropic.com)

So the verdict is: worth trying for serious coding-agent work, especially if you already use Claude Code or need help with complex refactors, migrations, debugging, or architecture. I would not treat it as a cheap daily-driver model for every task. Use it where better judgment saves time, and keep a lower-cost or less-limited model nearby for routine work.

What still needs independent verification is the size of the real-world improvement. Anthropic’s honesty and coding claims are promising, and some external coverage has noted the unusual emphasis on uncertainty handling. But we still need more independent evals across live repositories, long-running agent tasks, and failure modes like package hallucinations, over-editing, and test-suite gaming. For now, Opus 4.8 looks less like a revolution and more like a very capable specialist that you should spend carefully.

What changed

Anthropic released Claude Opus 4.8 on May 28, 2026.
Anthropic says Opus 4.8 is around 4x less likely than Opus 4.7 to let flaws in its own code pass without comment.
Claude Code gained Dynamic Workflows in research preview, allowing larger tasks to be split across many parallel subagents.
Opus 4.8 adds effort controls, with higher effort settings intended for more difficult or long-running tasks.
Regular API pricing remains $5 per million input tokens and $25 per million output tokens.
Fast mode pricing is $10 per million input tokens and $50 per million output tokens.

Practical value

Worth trying for serious coding-agent work where judgment matters more than cheap throughput.
Useful for complex refactors, migrations, debugging, architectural planning, and multi-step repository work.
Strong fit for Claude Code users who want a model that pushes back, catches mistakes, and handles uncertainty more carefully.
Better used selectively for high-value tasks than as an always-on default model.
Compared with GPT-5.5, it can feel wiser and more careful, but GPT-5.5 is easier to use for longer sessions with fewer practical limits.

Caveats

Opus remains expensive at $5 / $25 per million input/output tokens, so heavy agentic use can add up quickly.
Usage limits can make Opus difficult to rely on continuously, even when the model quality is excellent.
Anthropic’s 4x code-flaw claim is vendor-reported and should be retested on independent repositories.
Dynamic Workflows is in research preview, so reliability, availability, and cost behavior may vary.
Stronger uncertainty handling is promising, but it does not remove the need for tests, review, dependency checks, and human oversight.
Independent evals are still needed for long-running coding agents, package hallucinations, over-editing, and failure recovery.

Claude Opus 4.8 is excellent for coding agents, but expensive to live in

Claude Opus 4.8: better judgment is the headline

What changed

Practical value

Caveats

Review stance

Sources