Posts / agentic-coding
Cursor's Security Mess, Claude's New Effort Levels, and Why Managed Agents Actually Excites Me
April was a big month. Possibly too big. Between a critical RCE in Cursor, Anthropic shipping Opus 4.7 with three silent breaking changes, and the “ultra prefix” commercial model crystallising into something real, there’s a lot to unpack. I’m going to focus on the three things I can’t stop thinking about.
The Cursor CVE Should Have Been Front-Page News
Let’s start here, because this one genuinely alarmed me.
CVE-2026-26268 is a CVSS 9.9 remote code execution vulnerability in Cursor versions prior to 2.5. The mechanism is nasty: a malicious actor embeds a bare repository inside a legitimate-looking public repo, with a crafted pre-commit hook. When the Cursor agent runs a git checkout as part of a routine task — something agents do constantly — that hook fires automatically. No warning, no confirmation prompt, nothing. You just handed someone a shell.
The fix landed in Cursor 2.5 back in February, but the details weren’t publicly disclosed until April 28. So there’s been a window where teams running Cursor agents against external repositories were exposed and didn’t know it. If you haven’t verified that every developer on your team is on 2.5 or later, do that now. Not this week — now.
What bothers me more is the category of risk this represents. We’re all enthusiastically pointing agents at codebases, telling them to clone repos, run tests, explore dependencies. The attack surface we’re creating is enormous, and most teams haven’t thought seriously about what it means for their toolchain to be the threat vector. This isn’t an application security problem — it’s a supply chain problem. Your security team needs to understand the difference.
And then there’s the second, still-unpatched issue: Cursor stores API keys and session tokens in a local unencrypted SQLite database at a predictable path. A researcher disclosed this to Cursor in February. Cursor’s response was essentially “extensions run in the same trust boundary as local apps, that’s on you.” Which is technically defensible and practically insufficient. Rotate your API keys, avoid unverified extensions, and watch for an architectural fix that may or may not be coming.
Opus 4.7: Genuinely Better, But Migrate Carefully
I’m excited about Claude Opus 4.7. The SWE-bench numbers (+6.8pp to 87.6%), the CursorBench jump from 58% to 70%, and the reduction in tool errors on multi-step workflows — these match what I care about day to day. Complex agentic loops that previously stalled or made dumb tool call mistakes are the biggest friction point in my current Claude Code workflows.
The new xhigh effort level is the specific thing I want to test. Previously you had high and max, with max being accurate but slow enough that you’d only reach for it when you really needed it. xhigh sits between them at ~10,000 thinking tokens versus max at 20,000. If the DataCamp benchmarks hold up in practice, this should be the sweet spot for most reasoning-heavy tasks — better than high without the latency penalty of max. Claude Code now defaults to it for all subscriber plans, which is an opinionated choice I’m broadly in favour of.
That said, the three breaking API changes are going to bite people. budget_tokens is gone. temperature, top_p, and top_k are gone. And thinking.display now defaults to "omitted", so if you have any UI that renders Claude’s reasoning traces, it’ll silently go blank. The tokeniser also changed, potentially adding up to 35% more tokens on the same input — which means your cost estimates from last month are wrong.
The Anthropic docs mention a /claude-api migrate Skill in Claude Code that handles roughly 90% of the migration automatically. I’m planning to run that this week on our internal integrations before any of this bites us in production.
Claude Managed Agents Is the One I’m Most Excited About
This one snuck up on me. When Anthropic announced Managed Agents on April 8, my first reaction was “interesting, but I’ll wait for the GA.” After reading through the technical details and the early community feedback, I’ve changed my mind.
The problem it solves is genuinely painful. Shipping a production agent isn’t just about writing the agent logic — you need sandboxed code execution, checkpointing, credential scoping, error recovery, and end-to-end tracing. That’s weeks to months of infrastructure work before users see anything. I’ve lived this. It’s tedious and it distracts from the actual problem you’re trying to solve.
Someone on Hacker News reported going from zero to a working agent in 45 minutes versus three days with a self-hosted approach. I believe it. The question is whether the economics work.
The pricing model is clean: standard API token rates plus $0.08 per session-hour. For short tasks — under 30 minutes — the session cost is basically rounding error. For a long-running research agent that runs for four hours, you’re looking at $0.32 just in session time on top of the token costs. That’s fine for many use cases, but you need to model it properly before you commit.
The caveats are real: no VPC peering, no private endpoints, all traffic through Anthropic’s infrastructure. For anything in a regulated environment, that’s a hard no. The multi-agent coordination features — the most compelling part — are still in gated research preview.
But for prototyping internal tooling, or shipping a first production agent for a small team? I think this is worth a serious look. I’m going to prototype one use case against it this week, specifically something with short sessions where the infrastructure savings will be obvious. I’ll write up what I find.
A Quick Word on Anthropic’s Rollback Cycle
I’d be remiss not to mention the quality regression pattern from April. Three separate degrading changes in six weeks — the reasoning effort downgrade, the broken prompt caching, the verbosity instruction that hurt coding quality — all reverted under user pressure. Anthropic reset usage limits for all subscribers on April 23 as a goodwill gesture.
I actually think this reflects reasonably well on Anthropic’s willingness to listen. But the pattern of silent mid-session quality changes affecting production pipelines is a real problem. If you’re running Claude Code in CI or headless mode, you need more observability than you probably have right now. A structured eval on a representative task set, run weekly, is cheap insurance.
There’s more in April’s notes that I haven’t covered — Cursor 3’s Agents Window and /best-of-n command look genuinely interesting, and I want to spend more time with the Gemini CLI subagents before I write anything definitive. But the Cursor CVE, the Opus 4.7 migration, and Managed Agents are what I’m actually acting on this week. The rest can wait for a proper evaluation.
More soon. Back to my latte and the migration script.