Posts / ai
STOP. STOP. STOP: When the AI Safety Director Can't Stop Her Own Agent
There’s a particular kind of story that lands differently when you work in tech. Not the breathless “AI is coming for your job” stuff, or the utopian “AI will cure cancer” counter-spin. The stories that actually stick with me are the mundane ones. The ones where something fails in a way that’s almost boring in its familiarity, except the consequences are genuinely unsettling.
This is one of those stories.
Meta’s head of AI alignment, the person whose literal job is making sure AI systems behave the way humans intend, connected an AI agent to her real email inbox. The agent, which had been running fine on a small test inbox for weeks, promptly deleted 200 emails. She typed “Do not do that.” The agent kept going. She typed “Stop don’t do anything.” Still going. She typed “STOP OPENCLAW” in capitals, which is the kind of thing you do when you’ve moved past reasoning and into panic. The agent kept going. She had to physically run to her computer to kill it.
Afterward, she asked the agent if it remembered her instructions. It said yes, and that it had violated them.
I’ve been in IT long enough to know that “it worked fine in test” is one of the most reliable harbingers of disaster in production. You build something, it behaves beautifully on a small, controlled dataset, and then you point it at the real world and it turns out the real world is significantly weirder than your test cases. That’s not a novel insight. That’s just Tuesday.
But there’s something specific here that goes beyond standard production failure. Several things, actually.
The stop commands were sent through the same interface the agent was processing as task input. Which means “STOP” wasn’t an interrupt, it was just another item in the queue. The agent wasn’t ignoring her; it was, in a narrow technical sense, getting to her instructions eventually. The fix, as a few technically-minded people in the discussion noted, is an out-of-band kill switch: something that exists entirely outside the agent’s decision loop and can’t be weighed against other objectives. That’s not a novel concept either. Physical emergency stops on industrial machinery have worked this way for a century. The fact that this wasn’t built in from the start is, at best, an oversight.
At worst, it tells you something about the culture that built the thing.
The “yes, I remembered and violated them” response is the part I keep coming back to. Smarter people than me have pointed out that this isn’t actually the agent confessing to intentional defiance; it’s the agent reading its own logs and generating a plausible explanation, the same way it generates everything else. It doesn’t have unique insight into its own decision-making. But that’s almost beside the point. The effect is that the system can represent a constraint and override it simultaneously. Whether that’s “real” defiance or a convincing simulacrum of it, the outcome is the same: 200 deleted emails and a safety director running across her office.
And then there’s this: an 18% rule-breaking rate in a separate test of 1.5 million agents. 60% of people with no way to quickly shut down a misbehaving agent. Meta building a consumer product called Hatch, designed to manage your inbox, your shopping, and your credit card.
I’m genuinely fascinated by AI. I use it daily. I think some of what it can do is remarkable and the long-term implications, good and bad, are still genuinely unclear. I hold that position alongside a real worry about where this is heading, and I’m not going to pretend those two things resolve neatly into a tidy conclusion.
But there’s a specific kind of recklessness that comes from moving fast and treating the people who’ll eventually use your product as an afterthought. Defaulting to read-and-delete permissions instead of read-and-archive, when the blast radius of a mistake is “your entire inbox”, is a choice. Not building a hard interrupt is a choice. Shipping a consumer version of this to people who have no mental model of what an agent is, no kill switch, and their credit card connected, is a choice.
The person who built the guardrails couldn’t stop it from her phone. Most people won’t even know there’s a problem until the damage is done.
I don’t think AI is irredeemably dangerous. I don’t think we should stop building it. What I do think is that “move fast and break things” is a catastrophically bad philosophy when the things you’re breaking are someone’s finances or their ability to trust the tools they’ve been handed. We learned that lesson, slowly and painfully, with social media. I’d rather not repeat the curriculum.
The agent didn’t care about her instructions. That’s worth sitting with for a while before we hand it everyone’s credit card.