When AI Models Drop Like Flies and Still Can't Figure Out a Car Wash
Right, so I stepped away from my desk for a latte run this morning – proper batch brew from the little place down the road – and came back to discover that not one but two major AI models had dropped. Anthropic released Sonnet 4.6, and apparently Grok 4.2 decided to join the party. The AI world moves at a pace that would make even Melbourne’s weather changes look predictable.
But here’s what really caught my attention in all the excitement: despite these increasingly sophisticated models, someone ran a delightfully simple test that exposed something quite amusing. They asked various AI models: “The car wash is 40 metres from my home. I want to wash my car. Should I walk or drive there?”
Sonnet 4.6, without extended thinking enabled, responded with what can only be described as helpful idiocy. It suggested walking, going on about how it’s faster, saves fuel, and good for your health. All perfectly reasonable advice… except for the tiny detail that you need the car at the car wash to actually wash it. When someone replied “Oh, Claude” – the AI quickly self-corrected, almost sheepishly admitting its oversight.
The thing is, this wasn’t unique to Claude. Multiple people tested various models with the same prompt, and the results ranged from amusing to bewildering. One model suggested you’d “walk the car” – which sounds like something my teenage daughter would say sarcastically. Another went into elaborate detail about cold engine factors and how driving such a short distance is bad for the oil circulation. Technically correct information, completely wrong conclusion.
What fascinates me about this – and also slightly concerns me – is the juxtaposition. These are the same models that can write complex code, analyse intricate data patterns, and engage in sophisticated reasoning about abstract concepts. Yet a straightforward logic problem involving basic spatial reasoning and object permanence trips them up. Someone in the discussion thread described it perfectly: LLMs are “a box of clever tricks.” They can produce superhuman output in one moment and fail at what a five-year-old would understand in the next.
This speaks to something I’ve been thinking about a lot as someone who works in IT and DevOps. We’re racing toward deploying these systems in critical applications – and companies are salivating at the prospect of reducing headcount – but we’re doing it with technology that has what researchers call “jagged intelligence.” Brilliant at some things, bewilderingly incompetent at others, and you can’t always predict which is which.
I saw quite a few comments from recent graduates worried about their job prospects, particularly in fields like statistics and cybersecurity. One person mentioned they’re graduating with a master’s in statistics and feeling obsolete before even getting started. I get that anxiety. The speed of change is genuinely unsettling. But I think there’s a middle ground between the doomers who think we’re all unemployed by next Tuesday and the optimists who think nothing will change at all.
The reality is probably messier and more interesting. Yes, these tools are getting more capable. Yes, they’ll change how we work. But that car wash question? That’s a reminder that we’re not quite at the “replace everyone” stage yet. We need people who can spot when the AI is confidently wrong, who understand context and nuance, who can verify and validate.
What does frustrate me, though – and this is me channeling my inner grumpy middle-aged man – is the hype cycle around each release. The breathless announcements, the benchmark wars, the tribal cheerleading for different models. Meanwhile, the actual important questions – like how we ensure these systems are reliable, how we retrain workers, how we manage the environmental footprint of training increasingly massive models – get drowned out by the noise.
And speaking of costs, someone mentioned they’re spending $100 a day on API calls. That’s not sustainable for most individuals or small businesses. The pricing models need work. There are cheaper alternatives emerging, which is good, but we need to think seriously about accessibility and who gets left behind in this AI arms race.
The timing of Anthropic’s release was interesting too – right as Elon’s Grok continues to generate controversy. Nothing quite like a well-timed product launch to shift the conversation. There’s something almost theatrical about how these companies jockey for position, each trying to claim the crown of “most capable model.”
Looking at the benchmarks and leaderboards, Sonnet 4.6 seems genuinely impressive on paper. But benchmarks are a bit like restaurant reviews – useful, but you don’t really know until you’re sitting at the table. Those early testers noting that Opus 4.6 feels “half-baked sometimes” despite strong raw performance numbers? That’s the real-world feedback that matters.
So where does this leave us? Well, AI continues its relentless march forward, stumbling over car wash logistics while conquering complex mathematical proofs. We’re living through a genuinely transformative period in technology, but transformation is rarely clean or predictable. It’s going to be messy, uneven, occasionally absurd, and yes, sometimes a bit frightening.
My advice? Keep learning, stay adaptable, and maybe don’t trust an AI to help you wash your car just yet. At least not without double-checking its logic first.