Gemma 4 Is Here, and the Local AI Scene Is Going Absolutely Feral
So I’ve been down a rabbit hole this Easter weekend, and it has nothing to do with chocolate eggs. Google DeepMind dropped Gemma 4, and the local AI community has basically lost its collective mind — in the best possible way.
For those not deep in the weeds on this stuff, Gemma is Google’s family of open-weights AI models. The new Gemma 4 lineup ranges from tiny models designed to run on phones all the way up to a 31 billion parameter beast that’ll give your home server a decent workout. And the specs are genuinely impressive: multimodal input handling text, images, video and audio, context windows up to 256K tokens, native tool calling, built-in reasoning modes, and support for over 140 languages. That last point is actually more significant than most people give it credit for — more on that in a moment.
What’s really got people excited though — and I’ll admit, me included — is the combination of capability and accessibility. The 26B MoE (Mixture-of-Experts) model only activates about 4 billion parameters at a time, which means it punches well above its weight in terms of what hardware you actually need to run it. Someone in the discussion I was following made a joke about running it on a Commodore 64, which got a good laugh, but the underlying point is real: these models are becoming genuinely runnable on hardware that regular people actually own. That’s a big deal.
There’s also a licensing change that deserves more attention than it’s getting. Previous Gemma models shipped with Google’s own custom licence, which was technically “open” but had enough asterisks attached that enterprise users were understandably nervous. Gemma 4 ships under Apache 2.0 — the gold standard of permissive open-source licensing. You can use it commercially, modify it, redistribute it, basically do whatever you want with it. Google has gone from “open with fine print” to genuinely open, and that shifts the competitive landscape considerably. I suspect this decision will age very well for them.
The benchmarks tell an interesting story too. Someone compiled a comparison table against Qwen 3.5, which has been the darling of the local AI crowd for a while now, and the results are… competitive. Qwen edges ahead in several areas, particularly on those agentic tool-use benchmarks. But Gemma 4 holds its own, and in multilingual performance it’s genuinely strong. For those of us who work in environments that aren’t purely English — and frankly that’s most real-world enterprise deployments — that matters. A colleague of mine spent months trying to get a previous generation model to handle some French documentation consistently. Watching these multilingual benchmarks improve release by release has been quietly satisfying.
The community reaction has been predictably chaotic and wonderful. Within hours of release, people were already posting about quantised versions, comparing performance at different compression levels, and debating which tiny model would work best for a local voice assistant setup. One thread went deep into using the E4B model — one of the smaller ones with native audio input — as a direct pipeline from speech to response, cutting out the separate transcription step entirely. That kind of architecture simplification has real practical implications for anyone building local voice-enabled systems.
I’ll be honest about my own situation here. My home setup isn’t exactly bleeding edge. I’ve been running some of the smaller quantised models on my Mac for a while now, mostly for coding assistance and document summarisation, and the improvement curve has been remarkable. Models that would have required server-grade hardware eighteen months ago now run comfortably on consumer machines. Gemma 4’s E2B apparently outperforms Gemma 3’s 27B model on most benchmarks. Let that sink in — a 2 billion parameter model beating a 27 billion parameter model from just one generation ago. The efficiency gains are genuinely extraordinary.
That said, I do keep one eye on the less comfortable side of all this progress. The environmental cost of training these models is substantial, even if the inference footprint of running them locally is relatively small. Google hasn’t published the training energy figures for Gemma 4, and I wish they would — not as a gotcha, but because transparency matters when we’re making societal decisions about how fast and how far to push this technology. The “democratisation” argument is real and I believe in it, but it shouldn’t be used to wave away legitimate questions about the resource cost of getting here.
There are also the usual jokes in every thread about immediately fine-tuning these models into uncensored variants with increasingly unhinged names. I get it, the humour lands, and there are legitimate reasons to want models without overly conservative safety guardrails for specific research or creative applications. But it’s worth occasionally acknowledging that the people working on model safety aren’t all corporate villains — some of them are genuinely thinking hard about real problems. The community’s reflexive contempt for anything safety-related is occasionally a bit exhausting, even when specific implementations deserve criticism.
Still. Gemma 4 running locally, under Apache 2.0, with multimodal capabilities and genuine reasoning chops, in sizes that fit on consumer hardware. That’s a good outcome. The pace of progress here is kind of staggering, and for once I’m spending my Easter being genuinely optimistic about where the local AI ecosystem is heading.
Now if someone could just port the E4B audio support to llama.cpp already, that would be ideal. I have plans.